08 Dec, 2016

4 commits

  • In __sanitizer_cov_trace_pc we use task_struct and fields within it, but
    as we haven't included , it is not guaranteed to be
    defined. While we usually happen to acquire the definition through a
    transitive include, this is fragile (and hasn't been true in the past,
    causing issues with backports).

    Include to avoid any fragility.

    [mark.rutland@arm.com: rewrote changelog]
    Link: http://lkml.kernel.org/r/1481007384-27529-1-git-send-email-wangkefeng.wang@huawei.com
    Signed-off-by: Kefeng Wang
    Acked-by: Mark Rutland
    Cc: Dmitry Vyukov
    Cc: Andrey Ryabinin
    Cc: James Morse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kefeng Wang
     
  • Pull scheduler fix from Ingo Molnar:
    "An autogroup nice level adjustment bug fix"

    * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched/autogroup: Fix 64-bit kernel nice level adjustment

    Linus Torvalds
     
  • Pull perf fixes from Ingo Molnar:
    "A bogus warning fix, a counter width handling fix affecting certain
    machines, plus a oneliner hw-enablement patch for Knights Mill CPUs"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    perf/core: Remove invalid warning from list_update_cgroup_even()t
    perf/x86: Fix full width counter, counter overflow
    perf/x86/intel: Enable C-state residency events for Knights Mill

    Linus Torvalds
     
  • Pull locking fixes from Ingo Molnar:
    "Two rtmutex race fixes (which miraculously never triggered, that we
    know of), plus two lockdep printk formatting regression fixes"

    * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    lockdep: Fix report formatting
    locking/rtmutex: Use READ_ONCE() in rt_mutex_owner()
    locking/rtmutex: Prevent dequeue vs. unlock race
    locking/selftest: Fix output since KERN_CONT changes

    Linus Torvalds
     

06 Dec, 2016

2 commits

  • Since commit:

    4bcc595ccd80 ("printk: reinstate KERN_CONT for printing continuation lines")

    printk() requires KERN_CONT to continue log messages. Lots of printk()
    in lockdep.c and print_ip_sym() don't have it. As the result lockdep
    reports are completely messed up.

    Add missing KERN_CONT and inline print_ip_sym() where necessary.

    Example of a messed up report:

    0-rc5+ #41 Not tainted
    -------------------------------------------------------
    syz-executor0/5036 is trying to acquire lock:
    (
    rtnl_mutex
    ){+.+.+.}
    , at:
    [] rtnl_lock+0x1c/0x20
    but task is already holding lock:
    (
    &net->packet.sklist_lock
    ){+.+...}
    , at:
    [] packet_diag_dump+0x1a6/0x1920
    which lock already depends on the new lock.
    the existing dependency chain (in reverse order) is:
    -> #3
    (
    &net->packet.sklist_lock
    +.+...}
    ...

    Without this patch all scripts that parse kernel bug reports are broken.

    Signed-off-by: Dmitry Vyukov
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: andreyknvl@google.com
    Cc: aryabinin@virtuozzo.com
    Cc: joe@perches.com
    Cc: syzkaller@googlegroups.com
    Link: http://lkml.kernel.org/r/1480343083-48731-1-git-send-email-dvyukov@google.com
    Signed-off-by: Ingo Molnar

    Dmitry Vyukov
     
  • The warning introduced in commit:

    864c2357ca89 ("perf/core: Do not set cpuctx->cgrp for unscheduled cgroups")

    assumed that a cgroup switch always precedes list_del_event. This is
    not the case. Remove warning.

    Make sure that cpuctx->cgrp is NULL until a cgroup event is sched in
    or ctx->nr_cgroups == 0.

    Signed-off-by: David Carrillo-Cisneros
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Borislav Petkov
    Cc: Fenghua Yu
    Cc: Jiri Olsa
    Cc: Kan Liang
    Cc: Linus Torvalds
    Cc: Marcelo Tosatti
    Cc: Nilay Vaish
    Cc: Paul Turner
    Cc: Peter Zijlstra
    Cc: Ravi V Shankar
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vegard Nossum
    Cc: Vikas Shivappa
    Cc: Vince Weaver
    Link: http://lkml.kernel.org/r/1480841177-27299-1-git-send-email-davidcc@google.com
    Signed-off-by: Ingo Molnar

    David Carrillo-Cisneros
     

03 Dec, 2016

1 commit

  • Pull networking fixes from David Miller:

    1) Lots more phydev and probe error path leaks in various drivers by
    Johan Hovold.

    2) Fix race in packet_set_ring(), from Philip Pettersson.

    3) Use after free in dccp_invalid_packet(), from Eric Dumazet.

    4) Signnedness overflow in SO_{SND,RCV}BUFFORCE, also from Eric
    Dumazet.

    5) When tunneling between ipv4 and ipv6 we can be left with the wrong
    skb->protocol value as we enter the IPSEC engine and this causes all
    kinds of problems. Set it before the output path does any
    dst_output() calls, from Eli Cooper.

    6) bcmgenet uses wrong device struct pointer in DMA API calls, fix from
    Florian Fainelli.

    7) Various netfilter nat bug fixes from FLorian Westphal.

    8) Fix memory leak in ipvlan_link_new(), from Gao Feng.

    9) Locking fixes, particularly wrt. socket lookups, in l2tp from
    Guillaume Nault.

    10) Avoid invoking rhash teardowns in atomic context by moving netlink
    cb->done() dump completion from a worker thread. Fix from Herbert
    Xu.

    11) Buffer refcount problems in tun and macvtap on errors, from Jason
    Wang.

    12) We don't set Kconfig symbol DEFAULT_TCP_CONG properly when the user
    selects BBR. Fix from Julian Wollrath.

    13) Fix deadlock in transmit path on altera TSE driver, from Lino
    Sanfilippo.

    14) Fix unbalanced reference counting in dsa_switch_tree, from Nikita
    Yushchenko.

    15) tc_tunnel_key needs to be properly exported to userspace via uapi,
    fix from Roi Dayan.

    16) rds_tcp_init_net() doesn't unregister notifier in error path, fix
    from Sowmini Varadhan.

    17) Stale packet header pointer access after pskb_expand_head() in
    genenve driver, fix from Sabrina Dubroca.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (103 commits)
    net: avoid signed overflows for SO_{SND|RCV}BUFFORCE
    geneve: avoid use-after-free of skb->data
    tipc: check minimum bearer MTU
    net: renesas: ravb: unintialized return value
    sh_eth: remove unchecked interrupts for RZ/A1
    net: bcmgenet: Utilize correct struct device for all DMA operations
    NET: usb: qmi_wwan: add support for Telit LE922A PID 0x1040
    cdc_ether: Fix handling connection notification
    ip6_offload: check segs for NULL in ipv6_gso_segment.
    RDS: TCP: unregister_netdevice_notifier() in error path of rds_tcp_init_net
    Revert: "ip6_tunnel: Update skb->protocol to ETH_P_IPV6 in ip6_tnl_xmit()"
    ipv6: Set skb->protocol properly for local output
    ipv4: Set skb->protocol properly for local output
    packet: fix race condition in packet_set_ring
    net: ethernet: altera: TSE: do not use tx queue lock in tx completion handler
    net: ethernet: altera: TSE: Remove unneeded dma sync for tx buffers
    net: ethernet: stmmac: fix of-node and fixed-link-phydev leaks
    net: ethernet: stmmac: platform: fix outdated function header
    net: ethernet: stmmac: dwmac-meson8b: fix probe error path
    net: ethernet: stmmac: dwmac-generic: fix probe error path
    ...

    Linus Torvalds
     

02 Dec, 2016

2 commits

  • While debugging the rtmutex unlock vs. dequeue race Will suggested to use
    READ_ONCE() in rt_mutex_owner() as it might race against the
    cmpxchg_release() in unlock_rt_mutex_safe().

    Will: "It's a minor thing which will most likely not matter in practice"

    Careful search did not unearth an actual problem in todays code, but it's
    better to be safe than surprised.

    Suggested-by: Will Deacon
    Signed-off-by: Thomas Gleixner
    Acked-by: Peter Zijlstra (Intel)
    Cc: David Daney
    Cc: Linus Torvalds
    Cc: Mark Rutland
    Cc: Peter Zijlstra
    Cc: Sebastian Siewior
    Cc: Steven Rostedt
    Cc:
    Link: http://lkml.kernel.org/r/20161130210030.431379999@linutronix.de
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     
  • David reported a futex/rtmutex state corruption. It's caused by the
    following problem:

    CPU0 CPU1 CPU2

    l->owner=T1
    rt_mutex_lock(l)
    lock(l->wait_lock)
    l->owner = T1 | HAS_WAITERS;
    enqueue(T2)
    boost()
    unlock(l->wait_lock)
    schedule()

    rt_mutex_lock(l)
    lock(l->wait_lock)
    l->owner = T1 | HAS_WAITERS;
    enqueue(T3)
    boost()
    unlock(l->wait_lock)
    schedule()
    signal(->T2) signal(->T3)
    lock(l->wait_lock)
    dequeue(T2)
    deboost()
    unlock(l->wait_lock)
    lock(l->wait_lock)
    dequeue(T3)
    ===> wait list is now empty
    deboost()
    unlock(l->wait_lock)
    lock(l->wait_lock)
    fixup_rt_mutex_waiters()
    if (wait_list_empty(l)) {
    owner = l->owner & ~HAS_WAITERS;
    l->owner = owner
    ==> l->owner = T1
    }

    lock(l->wait_lock)
    rt_mutex_unlock(l) fixup_rt_mutex_waiters()
    if (wait_list_empty(l)) {
    owner = l->owner & ~HAS_WAITERS;
    cmpxchg(l->owner, T1, NULL)
    ===> Success (l->owner = NULL)
    l->owner = owner
    ==> l->owner = T1
    }

    That means the problem is caused by fixup_rt_mutex_waiters() which does the
    RMW to clear the waiters bit unconditionally when there are no waiters in
    the rtmutexes rbtree.

    This can be fatal: A concurrent unlock can release the rtmutex in the
    fastpath because the waiters bit is not set. If the cmpxchg() gets in the
    middle of the RMW operation then the previous owner, which just unlocked
    the rtmutex is set as the owner again when the write takes place after the
    successfull cmpxchg().

    The solution is rather trivial: verify that the owner member of the rtmutex
    has the waiters bit set before clearing it. This does not require a
    cmpxchg() or other atomic operations because the waiters bit can only be
    set and cleared with the rtmutex wait_lock held. It's also safe against the
    fast path unlock attempt. The unlock attempt via cmpxchg() will either see
    the bit set and take the slowpath or see the bit cleared and release it
    atomically in the fastpath.

    It's remarkable that the test program provided by David triggers on ARM64
    and MIPS64 really quick, but it refuses to reproduce on x86-64, while the
    problem exists there as well. That refusal might explain that this got not
    discovered earlier despite the bug existing from day one of the rtmutex
    implementation more than 10 years ago.

    Thanks to David for meticulously instrumenting the code and providing the
    information which allowed to decode this subtle problem.

    Reported-by: David Daney
    Tested-by: David Daney
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Steven Rostedt
    Acked-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mark Rutland
    Cc: Peter Zijlstra
    Cc: Sebastian Siewior
    Cc: Will Deacon
    Cc: stable@vger.kernel.org
    Fixes: 23f78d4a03c5 ("[PATCH] pi-futex: rt mutex core")
    Link: http://lkml.kernel.org/r/20161130210030.351136722@linutronix.de
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     

01 Dec, 2016

1 commit

  • If we have a branch that looks something like this

    int foo = map->value;
    if (condition) {
    foo += blah;
    } else {
    foo = bar;
    }
    map->array[foo] = baz;

    We will incorrectly assume that the !condition branch is equal to the condition
    branch as the register for foo will be UNKNOWN_VALUE in both cases. We need to
    adjust this logic to only do this if we didn't do a varlen access after we
    processed the !condition branch, otherwise we have different ranges and need to
    check the other branch as well.

    Fixes: 484611357c19 ("bpf: allow access into map value arrays")
    Reported-by: Jann Horn
    Signed-off-by: Josef Bacik
    Acked-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Josef Bacik
     

30 Nov, 2016

1 commit

  • This enables CONFIG_MODVERSIONS again, but allows for missing symbol CRC
    information in order to work around the issue that newer binutils
    versions seem to occasionally drop the CRC on the floor. binutils 2.26
    seems to work fine, while binutils 2.27 seems to break MODVERSIONS of
    symbols that have been defined in assembler files.

    [ We've had random missing CRC's before - it may be an old problem that
    just is now reliably triggered with the weak asm symbols and a new
    version of binutils ]

    Some day I really do want to remove MODVERSIONS entirely. Sadly, today
    does not appear to be that day: Debian people apparently do want the
    option to enable MODVERSIONS to make it easier to have external modules
    across kernel versions, and this seems to be a fairly minimal fix for
    the annoying problem.

    Cc: Ben Hutchings
    Acked-by: Michal Marek
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

24 Nov, 2016

2 commits

  • Michael Kerrisk reported:

    > Regarding the previous paragraph... My tests indicate
    > that writing *any* value to the autogroup [nice priority level]
    > file causes the task group to get a lower priority.

    Because autogroup didn't call the then meaningless scale_load()...

    Autogroup nice level adjustment has been broken ever since load
    resolution was increased for 64-bit kernels. Use scale_load() to
    scale group weight.

    Michael Kerrisk tested this patch to fix the problem:

    > Applied and tested against 4.9-rc6 on an Intel u7 (4 cores).
    > Test setup:
    >
    > Terminal window 1: running 40 CPU burner jobs
    > Terminal window 2: running 40 CPU burner jobs
    > Terminal window 1: running 1 CPU burner job
    >
    > Demonstrated that:
    > * Writing "0" to the autogroup file for TW1 now causes no change
    > to the rate at which the process on the terminal consume CPU.
    > * Writing -20 to the autogroup file for TW1 caused those processes
    > to get the lion's share of CPU while TW2 TW3 get a tiny amount.
    > * Writing -20 to the autogroup files for TW1 and TW3 allowed the
    > process on TW3 to get as much CPU as it was getting as when
    > the autogroup nice values for both terminals were 0.

    Reported-by: Michael Kerrisk
    Tested-by: Michael Kerrisk
    Signed-off-by: Mike Galbraith
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-man
    Cc: stable@vger.kernel.org
    Link: http://lkml.kernel.org/r/1479897217.4306.6.camel@gmx.de
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     
  • Pull perf fixes from Ingo Molnar:
    "Six fixes for bugs that were found via fuzzing, and a trivial
    hw-enablement patch for AMD Family-17h CPU PMUs"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    perf/x86/intel/uncore: Allow only a single PMU/box within an events group
    perf/x86/intel: Cure bogus unwind from PEBS entries
    perf/x86: Restore TASK_SIZE check on frame pointer
    perf/core: Fix address filter parser
    perf/x86: Add perf support for AMD family-17h processors
    perf/x86/uncore: Fix crash by removing bogus event_list[] handling for SNB client uncore IMC
    perf/core: Do not set cpuctx->cgrp for unscheduled cgroups

    Linus Torvalds
     

22 Nov, 2016

4 commits

  • Exactly because for_each_thread() in autogroup_move_group() can't see it
    and update its ->sched_task_group before _put() and possibly free().

    So the exiting task needs another sched_move_task() before exit_notify()
    and we need to re-introduce the PF_EXITING (or similar) check removed by
    the previous change for another reason.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: hartsjc@redhat.com
    Cc: vbendel@redhat.com
    Cc: vlovejoy@redhat.com
    Link: http://lkml.kernel.org/r/20161114184612.GA15968@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • The PF_EXITING check in task_wants_autogroup() is no longer needed. Remove
    it, but see the next patch.

    However the comment is correct in that autogroup_move_group() must always
    change task_group() for every thread so the sysctl_ check is very wrong;
    we can race with cgroups and even sys_setsid() is not safe because a task
    running with task_group() == ag->tg must participate in refcounting:

    int main(void)
    {
    int sctl = open("/proc/sys/kernel/sched_autogroup_enabled", O_WRONLY);

    assert(sctl > 0);
    if (fork()) {
    wait(NULL); // destroy the child's ag/tg
    pause();
    }

    assert(pwrite(sctl, "1\n", 2, 0) == 2);
    assert(setsid() > 0);
    if (fork())
    pause();

    kill(getppid(), SIGKILL);
    sleep(1);

    // The child has gone, the grandchild runs with kref == 1
    assert(pwrite(sctl, "0\n", 2, 0) == 2);
    assert(setsid() > 0);

    // runs with the freed ag/tg
    for (;;)
    sleep(1);

    return 0;
    }

    crashes the kernel. It doesn't really need sleep(1), it doesn't matter if
    autogroup_move_group() actually frees the task_group or this happens later.

    Reported-by: Vern Lovejoy
    Signed-off-by: Oleg Nesterov
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: hartsjc@redhat.com
    Cc: vbendel@redhat.com
    Link: http://lkml.kernel.org/r/20161114184609.GA15965@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • Pull sparc fixes from David Miller:

    1) With modern networking cards we can run out of 32-bit DMA space, so
    support 64-bit DMA addressing when possible on sparc64. From Dave
    Tushar.

    2) Some signal frame validation checks are inverted on sparc32, fix
    from Andreas Larsson.

    3) Lockdep tables can get too large in some circumstances on sparc64,
    add a way to adjust the size a bit. From Babu Moger.

    4) Fix NUMA node probing on some sun4v systems, from Thomas Tai.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc:
    sparc: drop duplicate header scatterlist.h
    lockdep: Limit static allocations if PROVE_LOCKING_SMALL is defined
    config: Adding the new config parameter CONFIG_PROVE_LOCKING_SMALL for sparc
    sunbmac: Fix compiler warning
    sunqe: Fix compiler warnings
    sparc64: Enable 64-bit DMA
    sparc64: Enable sun4v dma ops to use IOMMU v2 APIs
    sparc64: Bind PCIe devices to use IOMMU v2 service
    sparc64: Initialize iommu_map_table and iommu_pool
    sparc64: Add ATU (new IOMMU) support
    sparc64: Add FORCE_MAX_ZONEORDER and default to 13
    sparc64: fix compile warning section mismatch in find_node()
    sparc32: Fix inverted invalid_frame_pointer checks on sigreturns
    sparc64: Fix find_node warning if numa node cannot be found

    Linus Torvalds
     
  • Pull networking fixes from David Miller:

    1) Clear congestion control state when changing algorithms on an
    existing socket, from Florian Westphal.

    2) Fix register bit values in altr_tse_pcs portion of stmmac driver,
    from Jia Jie Ho.

    3) Fix PTP handling in stammc driver for GMAC4, from Giuseppe
    CAVALLARO.

    4) Fix udplite multicast delivery handling, it ignores the udp_table
    parameter passed into the lookups, from Pablo Neira Ayuso.

    5) Synchronize the space estimated by rtnl_vfinfo_size and the space
    actually used by rtnl_fill_vfinfo. From Sabrina Dubroca.

    6) Fix memory leak in fib_info when splitting nodes, from Alexander
    Duyck.

    7) If a driver does a napi_hash_del() explicitily and not via
    netif_napi_del(), it must perform RCU synchronization as needed. Fix
    this in virtio-net and bnxt drivers, from Eric Dumazet.

    8) Likewise, it is not necessary to invoke napi_hash_del() is we are
    also doing neif_napi_del() in the same code path. Remove such calls
    from be2net and cxgb4 drivers, also from Eric Dumazet.

    9) Don't allocate an ID in peernet2id_alloc() if the netns is dead,
    from WANG Cong.

    10) Fix OF node and device struct leaks in of_mdio, from Johan Hovold.

    11) We cannot cache routes in ip6_tunnel when using inherited traffic
    classes, from Paolo Abeni.

    12) Fix several crashes and leaks in cpsw driver, from Johan Hovold.

    13) Splice operations cannot use freezable blocking calls in AF_UNIX,
    from WANG Cong.

    14) Link dump filtering by master device and kind support added an error
    in loop index updates during the dump if we actually do filter, fix
    from Zhang Shengju.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (59 commits)
    tcp: zero ca_priv area when switching cc algorithms
    net: l2tp: Treat NET_XMIT_CN as success in l2tp_eth_dev_xmit
    ethernet: stmmac: make DWMAC_STM32 depend on it's associated SoC
    tipc: eliminate obsolete socket locking policy description
    rtnl: fix the loop index update error in rtnl_dump_ifinfo()
    l2tp: fix racy SOCK_ZAPPED flag check in l2tp_ip{,6}_bind()
    net: macb: add check for dma mapping error in start_xmit()
    rtnetlink: fix FDB size computation
    netns: fix get_net_ns_by_fd(int pid) typo
    af_unix: conditionally use freezable blocking calls in read
    net: ethernet: ti: cpsw: fix fixed-link phy probe deferral
    net: ethernet: ti: cpsw: add missing sanity check
    net: ethernet: ti: cpsw: fix secondary-emac probe error path
    net: ethernet: ti: cpsw: fix of_node and phydev leaks
    net: ethernet: ti: cpsw: fix deferred probe
    net: ethernet: ti: cpsw: fix mdio device reference leak
    net: ethernet: ti: cpsw: fix bad register access in probe error path
    net: sky2: Fix shutdown crash
    cfg80211: limit scan results cache size
    net sched filters: pass netlink message flags in event notification
    ...

    Linus Torvalds
     

21 Nov, 2016

1 commit

  • The token table passed into match_token() must be null-terminated, which
    it currently is not in the perf's address filter string parser, as caught
    by Vince's perf_fuzzer and KASAN.

    It doesn't blow up otherwise because of the alignment padding of the table
    to the next element in the .rodata, which is luck.

    Fixing by adding a null-terminator to the token table.

    Reported-by: Vince Weaver
    Tested-by: Vince Weaver
    Signed-off-by: Alexander Shishkin
    Acked-by: Peter Zijlstra (Intel)
    Cc: Arnaldo Carvalho de Melo
    Cc: Linus Torvalds
    Cc: Thomas Gleixner
    Cc: dvyukov@google.com
    Cc: stable@vger.kernel.org # v4.7+
    Fixes: 375637bc524 ("perf/core: Introduce address range filtering")
    Link: http://lkml.kernel.org/r/877f81f264.fsf@ashishki-desk.ger.corp.intel.com
    Signed-off-by: Ingo Molnar

    Alexander Shishkin
     

19 Nov, 2016

1 commit


17 Nov, 2016

1 commit

  • I made some invalid assumptions with BPF_AND and BPF_MOD that could result in
    invalid accesses to bpf map entries. Fix this up by doing a few things

    1) Kill BPF_MOD support. This doesn't actually get used by the compiler in real
    life and just adds extra complexity.

    2) Fix the logic for BPF_AND, don't allow AND of negative numbers and set the
    minimum value to 0 for positive AND's.

    3) Don't do operations on the ranges if they are set to the limits, as they are
    by definition undefined, and allowing arithmetic operations on those values
    could make them appear valid when they really aren't.

    This fixes the testcase provided by Jann as well as a few other theoretical
    problems.

    Reported-by: Jann Horn
    Signed-off-by: Josef Bacik
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Josef Bacik
     

16 Nov, 2016

1 commit

  • Pull tracing fixes from Steven Rostedt:
    "Alexei discovered a race condition in modules failing to load that can
    cause a ftrace check to trigger and disable ftrace.

    This is because of the way modules are registered to ftrace. Their
    functions are loaded in the ftrace function tables but set to
    "disabled" since they are still in the process of being loaded by the
    module. After the module is finished, it calls back into the ftrace
    infrastructure to enable it.

    Looking deeper into the locations that access all the functions in the
    table, I found more locations that should ignore the disabled ones"

    * tag 'trace-v4.9-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    ftrace: Add more checks for FTRACE_FL_DISABLED in processing ip records
    ftrace: Ignore FTRACE_FL_DISABLED while walking dyn_ftrace records

    Linus Torvalds
     

15 Nov, 2016

6 commits

  • Commit:

    db4a835601b7 ("perf/core: Set cgroup in CPU contexts for new cgroup events")

    failed to verify that event->cgrp is actually the scheduled cgroup
    in a CPU before setting cpuctx->cgrp. This patch fixes that.

    Now that there is a different path for scheduled and unscheduled
    cgroup, add a warning to catch when cpuctx->cgrp is still set after
    the last cgroup event has been unsheduled.

    To verify the bug:

    # Create 2 cgroups.
    mkdir /dev/cgroups/devices/g1
    mkdir /dev/cgroups/devices/g2

    # launch a task, bind it to a cpu and move it to g1
    CPU=2
    while :; do : ; done &
    P=$!

    taskset -pc $CPU $P
    echo $P > /dev/cgroups/devices/g1/tasks

    # monitor g2 (it runs no tasks) and observe output
    perf stat -e cycles -I 1000 -C $CPU -G g2

    # time counts unit events
    1.000091408 7,579,527 cycles g2
    2.000350111 cycles g2
    3.000589181 cycles g2
    4.000771428 cycles g2

    # note first line that displays that a task run in g2, despite
    # g2 having no tasks. This is because cpuctx->cgrp was wrongly
    # set when context of new event was installed.
    # After applying the fix we obtain the right output:

    perf stat -e cycles -I 1000 -C $CPU -G g2
    # time counts unit events
    1.000119615 cycles g2
    2.000389430 cycles g2
    3.000590962 cycles g2

    Signed-off-by: David Carrillo-Cisneros
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Kan Liang
    Cc: Linus Torvalds
    Cc: Nilay Vaish
    Cc: Paul Turner
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vegard Nossum
    Link: http://lkml.kernel.org/r/1478026378-86083-1-git-send-email-davidcc@google.com
    Signed-off-by: Ingo Molnar

    David Carrillo-Cisneros
     
  • Pull networking fixes from David Miller:

    1) Fix off by one wrt. indexing when dumping /proc/net/route entries,
    from Alexander Duyck.

    2) Fix lockdep splats in iwlwifi, from Johannes Berg.

    3) Cure panic when inserting certain netfilter rules when NFT_SET_HASH
    is disabled, from Liping Zhang.

    4) Memory leak when nft_expr_clone() fails, also from Liping Zhang.

    5) Disable UFO when path will apply IPSEC tranformations, from Jakub
    Sitnicki.

    6) Don't bogusly double cwnd in dctcp module, from Florian Westphal.

    7) skb_checksum_help() should never actually use the value "0" for the
    resulting checksum, that has a special meaning, use CSUM_MANGLED_0
    instead. From Eric Dumazet.

    8) Per-tx/rx queue statistic strings are wrong in qed driver, fix from
    Yuval MIntz.

    9) Fix SCTP reference counting of associations and transports in
    sctp_diag. From Xin Long.

    10) When we hit ip6tunnel_xmit() we could have come from an ipv4 path in
    a previous layer or similar, so explicitly clear the ipv6 control
    block in the skb. From Eli Cooper.

    11) Fix bogus sleeping inside of inet_wait_for_connect(), from WANG
    Cong.

    12) Correct deivce ID of T6 adapter in cxgb4 driver, from Hariprasad
    Shenai.

    13) Fix potential access past the end of the skb page frag array in
    tcp_sendmsg(). From Eric Dumazet.

    14) 'skb' can legitimately be NULL in inet{,6}_exact_dif_match(). Fix
    from David Ahern.

    15) Don't return an error in tcp_sendmsg() if we wronte any bytes
    successfully, from Eric Dumazet.

    16) Extraneous unlocks in netlink_diag_dump(), we removed the locking
    but forgot to purge these unlock calls. From Eric Dumazet.

    17) Fix memory leak in error path of __genl_register_family(). We leak
    the attrbuf, from WANG Cong.

    18) cgroupstats netlink policy table is mis-sized, from WANG Cong.

    19) Several XDP bug fixes in mlx5, from Saeed Mahameed.

    20) Fix several device refcount leaks in network drivers, from Johan
    Hovold.

    21) icmp6_send() should use skb dst device not skb->dev to determine L3
    routing domain. From David Ahern.

    22) ip_vs_genl_family sets maxattr incorrectly, from WANG Cong.

    23) We leak new macvlan port in some cases of maclan_common_netlink()
    errors. Fix from Gao Feng.

    24) Similar to the icmp6_send() fix, icmp_route_lookup() should
    determine L3 routing domain using skb_dst(skb)->dev not skb->dev.
    Also from David Ahern.

    25) Several fixes for route offloading and FIB notification handling in
    mlxsw driver, from Jiri Pirko.

    26) Properly cap __skb_flow_dissect()'s return value, from Eric Dumazet.

    27) Fix long standing regression in ipv4 redirect handling, wrt.
    validating the new neighbour's reachability. From Stephen Suryaputra
    Lin.

    28) If sk_filter() trims the packet excessively, handle it reasonably in
    tcp input instead of exploding. From Eric Dumazet.

    29) Fix handling of napi hash state when copying channels in sfc driver,
    from Bert Kenward.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (121 commits)
    mlxsw: spectrum_router: Flush FIB tables during fini
    net: stmmac: Fix lack of link transition for fixed PHYs
    sctp: change sk state only when it has assocs in sctp_shutdown
    bnx2: Wait for in-flight DMA to complete at probe stage
    Revert "bnx2: Reset device during driver initialization"
    ps3_gelic: fix spelling mistake in debug message
    net: ethernet: ixp4xx_eth: fix spelling mistake in debug message
    ibmvnic: Fix size of debugfs name buffer
    ibmvnic: Unmap ibmvnic_statistics structure
    sfc: clear napi_hash state when copying channels
    mlxsw: spectrum_router: Correctly dump neighbour activity
    mlxsw: spectrum: Fix refcount bug on span entries
    bnxt_en: Fix VF virtual link state.
    bnxt_en: Fix ring arithmetic in bnxt_setup_tc().
    Revert "include/uapi/linux/atm_zatm.h: include linux/time.h"
    tcp: take care of truncations done by sk_filter()
    ipv4: use new_gw for redirect neigh lookup
    r8152: Fix error path in open function
    net: bpqether.h: remove if_ether.h guard
    net: __skb_flow_dissect() must cap its return value
    ...

    Linus Torvalds
     
  • When a module is first loaded and its function ip records are added to the
    ftrace list of functions to modify, they are set to DISABLED, as their text
    is still in a read only state. When the module is fully loaded, and can be
    updated, the flag is cleared, and if their's any functions that should be
    tracing them, it is updated at that moment.

    But there's several locations that do record accounting and should ignore
    records that are marked as disabled, or they can cause issues.

    Alexei already fixed one location, but others need to be addressed.

    Cc: stable@vger.kernel.org
    Fixes: b7ffffbb46f2 "ftrace: Add infrastructure for delayed enabling of module functions"
    Reported-by: Alexei Starovoitov
    Signed-off-by: Steven Rostedt

    Steven Rostedt (Red Hat)
     
  • ftrace_shutdown() checks for sanity of ftrace records
    and if dyn_ftrace->flags is not zero, it will warn.
    It can happen that 'flags' are set to FTRACE_FL_DISABLED at this point,
    since some module was loaded, but before ftrace_module_enable()
    cleared the flags for this module.

    In other words the module.c is doing:
    ftrace_module_init(mod); // calls ftrace_update_code() that sets flags=FTRACE_FL_DISABLED
    ... // here ftrace_shutdown() is called that warns, since
    err = prepare_coming_module(mod); // didn't have a chance to clear FTRACE_FL_DISABLED

    Fix it by ignoring disabled records.
    It's similar to what __ftrace_hash_rec_update() is already doing.

    Link: http://lkml.kernel.org/r/1478560460-3818619-1-git-send-email-ast@fb.com

    Cc: stable@vger.kernel.org
    Fixes: b7ffffbb46f2 "ftrace: Add infrastructure for delayed enabling of module functions"
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Steven Rostedt

    Alexei Starovoitov
     
  • This reverts commit bfd8d3f23b51018388be0411ccbc2d56277fe294.

    It turns out that this flushes things much too aggressiverly, and causes
    lines to break up when the system logger races with new continuation
    lines being printed.

    There's a pending patch to make printk() flushing much more
    straightforward, but it's too invasive for 4.9, so in the meantime let's
    just not make the system message logging flush continuation lines.
    They'll be flushed by the final newline anyway.

    Suggested-by: Petr Mladek
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Pull irq fix from Ingo Molnar:
    "This fixes a genirq regression that resulted in the Intel/Broxton
    pinctrl/GPIO driver (and possibly others) spewing warnings"

    * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    genirq: Use irq type from irqdata instead of irqdesc

    Linus Torvalds
     

12 Nov, 2016

3 commits

  • Pull power management fixes from Rafael Wysocki:
    "These fix two bugs in error code paths in the PM core (system-wide
    suspend of devices), a device reference leak in the boot-time suspend
    test code and a cpupower utility regression from the 4.7 cycle.

    Specifics:

    - Prevent the PM core from attempting to suspend parent devices if
    any of their children, whose suspend callbacks were invoked
    asynchronously, have failed to suspend during the "late" and
    "noirq" phases of system-wide suspend of devices (Brian Norris).

    - Prevent the boot-time system suspend test code from leaking a
    reference to the RTC device used by it (Johan Hovold).

    - Fix cpupower to use the return value of one of its library
    functions correctly and restore the correct behavior of it when
    used for setting cpufreq tunables broken during the 4.7 development
    cycle (Laura Abbott)"

    * tag 'pm-4.9-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
    PM / sleep: don't suspend parent when async child suspend_{noirq, late} fails
    PM / sleep: fix device reference leak in test_suspend
    cpupower: Correct return type of cpu_power_is_cpu_online() in cpufreq-set

    Linus Torvalds
     
  • * pm-tools-fixes:
    cpupower: Correct return type of cpu_power_is_cpu_online() in cpufreq-set

    * pm-sleep-fixes:
    PM / sleep: don't suspend parent when async child suspend_{noirq, late} fails
    PM / sleep: fix device reference leak in test_suspend

    Rafael J. Wysocki
     
  • This reverts commit 05fd007e4629 ("console: don't prefer first
    registered if DT specifies stdout-path").

    The reverted commit changes existing behavior on which many ARM boards
    rely. Many ARM small-board-computers, like e.g. the Raspberry Pi have
    both a video output and a serial console. Depending on whether the user
    is using the device as a more regular computer; or as a headless device
    we need to have the console on either one or the other.

    Many users rely on the kernel behavior of the console being present on
    both outputs, before the reverted commit the console setup with no
    console= kernel arguments on an ARM board which sets stdout-path in dt
    would look like this:

    [root@localhost ~]# cat /proc/consoles
    ttyS0 -W- (EC p a) 4:64
    tty0 -WU (E p ) 4:1

    Where as after the reverted commit, it looks like this:

    [root@localhost ~]# cat /proc/consoles
    ttyS0 -W- (EC p a) 4:64

    This commit reverts commit 05fd007e4629 ("console: don't prefer first
    registered if DT specifies stdout-path") restoring the original
    behavior.

    Fixes: 05fd007e4629 ("console: don't prefer first registered if DT specifies stdout-path")
    Link: http://lkml.kernel.org/r/20161104121135.4780-2-hdegoede@redhat.com
    Signed-off-by: Hans de Goede
    Cc: Paul Burton
    Cc: Rob Herring
    Cc: Frank Rowand
    Cc: Thorsten Leemhuis
    Cc: Greg Kroah-Hartman
    Cc: Tejun Heo
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hans de Goede
     

08 Nov, 2016

3 commits

  • The type flags in the irq descriptor are there for historical reasons and
    only updated via irq_modify_status() or irq_set_type(). Both functions also
    update the type flags in irqdata. __setup_irq() is the only left over user
    of the type flags in the irq descriptor.

    If __setup_irq() is called with empty irq type flags, then the type flags
    are retrieved from irqdata. If an interrupt is shared, then the type flags
    are compared with the type flags stored in the irq descriptor.

    On x86 the ioapic does not have a irq_set_type() callback because the type
    is defined in the BIOS tables and cannot be changed. The type is stored in
    irqdata at setup time without updating the type data in the irq
    descriptor. As a result the comparison described above fails.

    There is no point in updating the irq descriptor flags because the only
    relevant storage is irqdata. Use the type flags from irqdata for both
    retrieval and comparison in __setup_irq() instead.

    Aside of that the print out in case of non matching type flags has the old
    and new type flags arguments flipped. Fix that as well.

    For correctness sake the flags stored in the irq descriptor should be
    removed, but this is beyond the scope of this bugfix and will be done in a
    later patch.

    Fixes: 4b357daed698 ("genirq: Look-up trigger type if not specified by caller")
    Reported-and-tested-by: Mika Westerberg
    Signed-off-by: Thomas Gleixner
    Cc: Marc Zyngier
    Cc: Jon Hunter
    Cc: stable@vger.kernel.org
    Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1611072020360.3501@nanos
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • In map_create(), we first find and create the map, then once that
    suceeded, we charge it to the user's RLIMIT_MEMLOCK, and then fetch
    a new anon fd through anon_inode_getfd(). The problem is, once the
    latter fails f.e. due to RLIMIT_NOFILE limit, then we only destruct
    the map via map->ops->map_free(), but without uncharging the previously
    locked memory first. That means that the user_struct allocation is
    leaked as well as the accounted RLIMIT_MEMLOCK memory not released.
    Make the label names in the fix consistent with bpf_prog_load().

    Fixes: aaac3ba95e4c ("bpf: charge user for creation of BPF maps and programs")
    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Commit a6ed3ea65d98 ("bpf: restore behavior of bpf_map_update_elem")
    added an extra per-cpu reserve to the hash table map to restore old
    behaviour from pre prealloc times. When non-prealloc is in use for a
    map, then problem is that once a hash table extra element has been
    linked into the hash-table, and the hash table is destroyed due to
    refcount dropping to zero, then htab_map_free() -> delete_all_elements()
    will walk the whole hash table and drop all elements via htab_elem_free().
    The problem is that the element from the extra reserve is first fed
    to the wrong backend allocator and eventually freed twice.

    Fixes: a6ed3ea65d98 ("bpf: restore behavior of bpf_map_update_elem")
    Reported-by: Dmitry Vyukov
    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

06 Nov, 2016

1 commit

  • ….kernel.org/pub/scm/linux/kernel/git/tip/tip

    Pull stack vmap fixups from Thomas Gleixner:
    "Two small patches related to sched_show_task():

    - make sure to hold a reference on the task stack while accessing it

    - remove the thread_saved_pc printout

    .. and add a sanity check into release_task_stack() to catch problems
    with task stack references"

    * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched/core: Remove pointless printout in sched_show_task()
    sched/core: Fix oops in sched_show_task()

    * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    fork: Add task stack refcounting sanity check and prevent premature task stack freeing

    Linus Torvalds
     

04 Nov, 2016

1 commit

  • cgroupstats_cmd_get_policy is [CGROUPSTATS_CMD_ATTR_MAX+1],
    taskstats_cmd_get_policy[TASKSTATS_CMD_ATTR_MAX+1],
    but their family.maxattr is TASKSTATS_CMD_ATTR_MAX.
    CGROUPSTATS_CMD_ATTR_MAX is less than TASKSTATS_CMD_ATTR_MAX,
    so we could end up accessing out-of-bound.

    Change cgroupstats_cmd_get_policy to TASKSTATS_CMD_ATTR_MAX+1,
    this is safe because the rest are initialized to 0's.

    Reported-by: Andrey Konovalov
    Tested-by: Andrey Konovalov
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    WANG Cong
     

03 Nov, 2016

2 commits

  • In sched_show_task() we print out a useless hex number, not even a
    symbol, and there's a big question mark whether this even makes sense
    anyway, I suspect we should just remove it all.

    Signed-off-by: Linus Torvalds
    Acked-by: Andy Lutomirski
    Cc: Peter Zijlstra
    Cc: Tetsuo Handa
    Cc: Thomas Gleixner
    Cc: bp@alien8.de
    Cc: brgerst@gmail.com
    Cc: jann@thejh.net
    Cc: keescook@chromium.org
    Cc: linux-api@vger.kernel.org
    Cc: tycho.andersen@canonical.com
    Link: http://lkml.kernel.org/r/CA+55aFzphURPFzAvU4z6Moy7ZmimcwPuUdYU8bj9z0J+S8X1rw@mail.gmail.com
    Signed-off-by: Ingo Molnar

    Linus Torvalds
     
  • When CONFIG_THREAD_INFO_IN_TASK=y, it is possible that an exited thread
    remains in the task list after its stack pointer was already set to NULL.

    Therefore, thread_saved_pc() and stack_not_used() in sched_show_task()
    will trigger NULL pointer dereference if an attempt to dump such thread's
    traces (e.g. SysRq-t, khungtaskd) is made.

    Since show_stack() in sched_show_task() calls try_get_task_stack() and
    sched_show_task() is called from interrupt context, calling
    try_get_task_stack() from sched_show_task() will be safe as well.

    Signed-off-by: Tetsuo Handa
    Acked-by: Andy Lutomirski
    Acked-by: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: bp@alien8.de
    Cc: brgerst@gmail.com
    Cc: jann@thejh.net
    Cc: keescook@chromium.org
    Cc: linux-api@vger.kernel.org
    Cc: tycho.andersen@canonical.com
    Link: http://lkml.kernel.org/r/201611021950.FEJ34368.HFFJOOMLtQOVSF@I-love.SAKURA.ne.jp
    Signed-off-by: Ingo Molnar

    Tetsuo Handa
     

02 Nov, 2016

1 commit


01 Nov, 2016

1 commit

  • If something goes wrong with task stack refcounting and a stack
    refcount hits zero too early, warn and leak it rather than
    potentially freeing it early (and silently).

    Signed-off-by: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Josh Poimboeuf
    Cc: Linus Torvalds
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/f29119c783a9680a4b4656e751b6123917ace94b.1477926663.git.luto@kernel.org
    Signed-off-by: Ingo Molnar

    Andy Lutomirski
     

29 Oct, 2016

1 commit

  • Pull power management fixes from Rafael Wysocki:
    "These fix two intel_pstate issues related to the way it works when the
    scaling_governor sysfs attribute is set to "performance" and fix up
    messages in the system suspend core code.

    Specifics:

    - Fix a missing KERN_CONT in a system suspend message by converting
    the affected code to using pr_info() and pr_cont() instead of the
    "raw" printk() (Jon Hunter).

    - Make intel_pstate set the CPU P-state from its .set_policy()
    callback when the scaling_governor sysfs attribute is set to
    "performance" so that it interacts with NOHZ_FULL more predictably
    which was the case before 4.7 (Rafael Wysocki).

    - Make intel_pstate always request the maximum allowed P-state when
    the scaling_governor sysfs attribute is set to "performance" to
    prevent it from effectively ingoring that setting is some
    situations (Rafael Wysocki)"

    * tag 'pm-4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
    cpufreq: intel_pstate: Always set max P-state in performance mode
    PM / suspend: Fix missing KERN_CONT for suspend message
    cpufreq: intel_pstate: Set P-state upfront in performance mode

    Linus Torvalds