30 Nov, 2016

1 commit

  • This enables CONFIG_MODVERSIONS again, but allows for missing symbol CRC
    information in order to work around the issue that newer binutils
    versions seem to occasionally drop the CRC on the floor. binutils 2.26
    seems to work fine, while binutils 2.27 seems to break MODVERSIONS of
    symbols that have been defined in assembler files.

    [ We've had random missing CRC's before - it may be an old problem that
    just is now reliably triggered with the weak asm symbols and a new
    version of binutils ]

    Some day I really do want to remove MODVERSIONS entirely. Sadly, today
    does not appear to be that day: Debian people apparently do want the
    option to enable MODVERSIONS to make it easier to have external modules
    across kernel versions, and this seems to be a fairly minimal fix for
    the annoying problem.

    Cc: Ben Hutchings
    Acked-by: Michal Marek
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

24 Nov, 2016

1 commit

  • Pull perf fixes from Ingo Molnar:
    "Six fixes for bugs that were found via fuzzing, and a trivial
    hw-enablement patch for AMD Family-17h CPU PMUs"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    perf/x86/intel/uncore: Allow only a single PMU/box within an events group
    perf/x86/intel: Cure bogus unwind from PEBS entries
    perf/x86: Restore TASK_SIZE check on frame pointer
    perf/core: Fix address filter parser
    perf/x86: Add perf support for AMD family-17h processors
    perf/x86/uncore: Fix crash by removing bogus event_list[] handling for SNB client uncore IMC
    perf/core: Do not set cpuctx->cgrp for unscheduled cgroups

    Linus Torvalds
     

22 Nov, 2016

4 commits

  • Exactly because for_each_thread() in autogroup_move_group() can't see it
    and update its ->sched_task_group before _put() and possibly free().

    So the exiting task needs another sched_move_task() before exit_notify()
    and we need to re-introduce the PF_EXITING (or similar) check removed by
    the previous change for another reason.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: hartsjc@redhat.com
    Cc: vbendel@redhat.com
    Cc: vlovejoy@redhat.com
    Link: http://lkml.kernel.org/r/20161114184612.GA15968@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • The PF_EXITING check in task_wants_autogroup() is no longer needed. Remove
    it, but see the next patch.

    However the comment is correct in that autogroup_move_group() must always
    change task_group() for every thread so the sysctl_ check is very wrong;
    we can race with cgroups and even sys_setsid() is not safe because a task
    running with task_group() == ag->tg must participate in refcounting:

    int main(void)
    {
    int sctl = open("/proc/sys/kernel/sched_autogroup_enabled", O_WRONLY);

    assert(sctl > 0);
    if (fork()) {
    wait(NULL); // destroy the child's ag/tg
    pause();
    }

    assert(pwrite(sctl, "1\n", 2, 0) == 2);
    assert(setsid() > 0);
    if (fork())
    pause();

    kill(getppid(), SIGKILL);
    sleep(1);

    // The child has gone, the grandchild runs with kref == 1
    assert(pwrite(sctl, "0\n", 2, 0) == 2);
    assert(setsid() > 0);

    // runs with the freed ag/tg
    for (;;)
    sleep(1);

    return 0;
    }

    crashes the kernel. It doesn't really need sleep(1), it doesn't matter if
    autogroup_move_group() actually frees the task_group or this happens later.

    Reported-by: Vern Lovejoy
    Signed-off-by: Oleg Nesterov
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: hartsjc@redhat.com
    Cc: vbendel@redhat.com
    Link: http://lkml.kernel.org/r/20161114184609.GA15965@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • Pull sparc fixes from David Miller:

    1) With modern networking cards we can run out of 32-bit DMA space, so
    support 64-bit DMA addressing when possible on sparc64. From Dave
    Tushar.

    2) Some signal frame validation checks are inverted on sparc32, fix
    from Andreas Larsson.

    3) Lockdep tables can get too large in some circumstances on sparc64,
    add a way to adjust the size a bit. From Babu Moger.

    4) Fix NUMA node probing on some sun4v systems, from Thomas Tai.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc:
    sparc: drop duplicate header scatterlist.h
    lockdep: Limit static allocations if PROVE_LOCKING_SMALL is defined
    config: Adding the new config parameter CONFIG_PROVE_LOCKING_SMALL for sparc
    sunbmac: Fix compiler warning
    sunqe: Fix compiler warnings
    sparc64: Enable 64-bit DMA
    sparc64: Enable sun4v dma ops to use IOMMU v2 APIs
    sparc64: Bind PCIe devices to use IOMMU v2 service
    sparc64: Initialize iommu_map_table and iommu_pool
    sparc64: Add ATU (new IOMMU) support
    sparc64: Add FORCE_MAX_ZONEORDER and default to 13
    sparc64: fix compile warning section mismatch in find_node()
    sparc32: Fix inverted invalid_frame_pointer checks on sigreturns
    sparc64: Fix find_node warning if numa node cannot be found

    Linus Torvalds
     
  • Pull networking fixes from David Miller:

    1) Clear congestion control state when changing algorithms on an
    existing socket, from Florian Westphal.

    2) Fix register bit values in altr_tse_pcs portion of stmmac driver,
    from Jia Jie Ho.

    3) Fix PTP handling in stammc driver for GMAC4, from Giuseppe
    CAVALLARO.

    4) Fix udplite multicast delivery handling, it ignores the udp_table
    parameter passed into the lookups, from Pablo Neira Ayuso.

    5) Synchronize the space estimated by rtnl_vfinfo_size and the space
    actually used by rtnl_fill_vfinfo. From Sabrina Dubroca.

    6) Fix memory leak in fib_info when splitting nodes, from Alexander
    Duyck.

    7) If a driver does a napi_hash_del() explicitily and not via
    netif_napi_del(), it must perform RCU synchronization as needed. Fix
    this in virtio-net and bnxt drivers, from Eric Dumazet.

    8) Likewise, it is not necessary to invoke napi_hash_del() is we are
    also doing neif_napi_del() in the same code path. Remove such calls
    from be2net and cxgb4 drivers, also from Eric Dumazet.

    9) Don't allocate an ID in peernet2id_alloc() if the netns is dead,
    from WANG Cong.

    10) Fix OF node and device struct leaks in of_mdio, from Johan Hovold.

    11) We cannot cache routes in ip6_tunnel when using inherited traffic
    classes, from Paolo Abeni.

    12) Fix several crashes and leaks in cpsw driver, from Johan Hovold.

    13) Splice operations cannot use freezable blocking calls in AF_UNIX,
    from WANG Cong.

    14) Link dump filtering by master device and kind support added an error
    in loop index updates during the dump if we actually do filter, fix
    from Zhang Shengju.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (59 commits)
    tcp: zero ca_priv area when switching cc algorithms
    net: l2tp: Treat NET_XMIT_CN as success in l2tp_eth_dev_xmit
    ethernet: stmmac: make DWMAC_STM32 depend on it's associated SoC
    tipc: eliminate obsolete socket locking policy description
    rtnl: fix the loop index update error in rtnl_dump_ifinfo()
    l2tp: fix racy SOCK_ZAPPED flag check in l2tp_ip{,6}_bind()
    net: macb: add check for dma mapping error in start_xmit()
    rtnetlink: fix FDB size computation
    netns: fix get_net_ns_by_fd(int pid) typo
    af_unix: conditionally use freezable blocking calls in read
    net: ethernet: ti: cpsw: fix fixed-link phy probe deferral
    net: ethernet: ti: cpsw: add missing sanity check
    net: ethernet: ti: cpsw: fix secondary-emac probe error path
    net: ethernet: ti: cpsw: fix of_node and phydev leaks
    net: ethernet: ti: cpsw: fix deferred probe
    net: ethernet: ti: cpsw: fix mdio device reference leak
    net: ethernet: ti: cpsw: fix bad register access in probe error path
    net: sky2: Fix shutdown crash
    cfg80211: limit scan results cache size
    net sched filters: pass netlink message flags in event notification
    ...

    Linus Torvalds
     

21 Nov, 2016

1 commit

  • The token table passed into match_token() must be null-terminated, which
    it currently is not in the perf's address filter string parser, as caught
    by Vince's perf_fuzzer and KASAN.

    It doesn't blow up otherwise because of the alignment padding of the table
    to the next element in the .rodata, which is luck.

    Fixing by adding a null-terminator to the token table.

    Reported-by: Vince Weaver
    Tested-by: Vince Weaver
    Signed-off-by: Alexander Shishkin
    Acked-by: Peter Zijlstra (Intel)
    Cc: Arnaldo Carvalho de Melo
    Cc: Linus Torvalds
    Cc: Thomas Gleixner
    Cc: dvyukov@google.com
    Cc: stable@vger.kernel.org # v4.7+
    Fixes: 375637bc524 ("perf/core: Introduce address range filtering")
    Link: http://lkml.kernel.org/r/877f81f264.fsf@ashishki-desk.ger.corp.intel.com
    Signed-off-by: Ingo Molnar

    Alexander Shishkin
     

19 Nov, 2016

1 commit


17 Nov, 2016

1 commit

  • I made some invalid assumptions with BPF_AND and BPF_MOD that could result in
    invalid accesses to bpf map entries. Fix this up by doing a few things

    1) Kill BPF_MOD support. This doesn't actually get used by the compiler in real
    life and just adds extra complexity.

    2) Fix the logic for BPF_AND, don't allow AND of negative numbers and set the
    minimum value to 0 for positive AND's.

    3) Don't do operations on the ranges if they are set to the limits, as they are
    by definition undefined, and allowing arithmetic operations on those values
    could make them appear valid when they really aren't.

    This fixes the testcase provided by Jann as well as a few other theoretical
    problems.

    Reported-by: Jann Horn
    Signed-off-by: Josef Bacik
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Josef Bacik
     

16 Nov, 2016

1 commit

  • Pull tracing fixes from Steven Rostedt:
    "Alexei discovered a race condition in modules failing to load that can
    cause a ftrace check to trigger and disable ftrace.

    This is because of the way modules are registered to ftrace. Their
    functions are loaded in the ftrace function tables but set to
    "disabled" since they are still in the process of being loaded by the
    module. After the module is finished, it calls back into the ftrace
    infrastructure to enable it.

    Looking deeper into the locations that access all the functions in the
    table, I found more locations that should ignore the disabled ones"

    * tag 'trace-v4.9-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    ftrace: Add more checks for FTRACE_FL_DISABLED in processing ip records
    ftrace: Ignore FTRACE_FL_DISABLED while walking dyn_ftrace records

    Linus Torvalds
     

15 Nov, 2016

6 commits

  • Commit:

    db4a835601b7 ("perf/core: Set cgroup in CPU contexts for new cgroup events")

    failed to verify that event->cgrp is actually the scheduled cgroup
    in a CPU before setting cpuctx->cgrp. This patch fixes that.

    Now that there is a different path for scheduled and unscheduled
    cgroup, add a warning to catch when cpuctx->cgrp is still set after
    the last cgroup event has been unsheduled.

    To verify the bug:

    # Create 2 cgroups.
    mkdir /dev/cgroups/devices/g1
    mkdir /dev/cgroups/devices/g2

    # launch a task, bind it to a cpu and move it to g1
    CPU=2
    while :; do : ; done &
    P=$!

    taskset -pc $CPU $P
    echo $P > /dev/cgroups/devices/g1/tasks

    # monitor g2 (it runs no tasks) and observe output
    perf stat -e cycles -I 1000 -C $CPU -G g2

    # time counts unit events
    1.000091408 7,579,527 cycles g2
    2.000350111 cycles g2
    3.000589181 cycles g2
    4.000771428 cycles g2

    # note first line that displays that a task run in g2, despite
    # g2 having no tasks. This is because cpuctx->cgrp was wrongly
    # set when context of new event was installed.
    # After applying the fix we obtain the right output:

    perf stat -e cycles -I 1000 -C $CPU -G g2
    # time counts unit events
    1.000119615 cycles g2
    2.000389430 cycles g2
    3.000590962 cycles g2

    Signed-off-by: David Carrillo-Cisneros
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Kan Liang
    Cc: Linus Torvalds
    Cc: Nilay Vaish
    Cc: Paul Turner
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vegard Nossum
    Link: http://lkml.kernel.org/r/1478026378-86083-1-git-send-email-davidcc@google.com
    Signed-off-by: Ingo Molnar

    David Carrillo-Cisneros
     
  • Pull networking fixes from David Miller:

    1) Fix off by one wrt. indexing when dumping /proc/net/route entries,
    from Alexander Duyck.

    2) Fix lockdep splats in iwlwifi, from Johannes Berg.

    3) Cure panic when inserting certain netfilter rules when NFT_SET_HASH
    is disabled, from Liping Zhang.

    4) Memory leak when nft_expr_clone() fails, also from Liping Zhang.

    5) Disable UFO when path will apply IPSEC tranformations, from Jakub
    Sitnicki.

    6) Don't bogusly double cwnd in dctcp module, from Florian Westphal.

    7) skb_checksum_help() should never actually use the value "0" for the
    resulting checksum, that has a special meaning, use CSUM_MANGLED_0
    instead. From Eric Dumazet.

    8) Per-tx/rx queue statistic strings are wrong in qed driver, fix from
    Yuval MIntz.

    9) Fix SCTP reference counting of associations and transports in
    sctp_diag. From Xin Long.

    10) When we hit ip6tunnel_xmit() we could have come from an ipv4 path in
    a previous layer or similar, so explicitly clear the ipv6 control
    block in the skb. From Eli Cooper.

    11) Fix bogus sleeping inside of inet_wait_for_connect(), from WANG
    Cong.

    12) Correct deivce ID of T6 adapter in cxgb4 driver, from Hariprasad
    Shenai.

    13) Fix potential access past the end of the skb page frag array in
    tcp_sendmsg(). From Eric Dumazet.

    14) 'skb' can legitimately be NULL in inet{,6}_exact_dif_match(). Fix
    from David Ahern.

    15) Don't return an error in tcp_sendmsg() if we wronte any bytes
    successfully, from Eric Dumazet.

    16) Extraneous unlocks in netlink_diag_dump(), we removed the locking
    but forgot to purge these unlock calls. From Eric Dumazet.

    17) Fix memory leak in error path of __genl_register_family(). We leak
    the attrbuf, from WANG Cong.

    18) cgroupstats netlink policy table is mis-sized, from WANG Cong.

    19) Several XDP bug fixes in mlx5, from Saeed Mahameed.

    20) Fix several device refcount leaks in network drivers, from Johan
    Hovold.

    21) icmp6_send() should use skb dst device not skb->dev to determine L3
    routing domain. From David Ahern.

    22) ip_vs_genl_family sets maxattr incorrectly, from WANG Cong.

    23) We leak new macvlan port in some cases of maclan_common_netlink()
    errors. Fix from Gao Feng.

    24) Similar to the icmp6_send() fix, icmp_route_lookup() should
    determine L3 routing domain using skb_dst(skb)->dev not skb->dev.
    Also from David Ahern.

    25) Several fixes for route offloading and FIB notification handling in
    mlxsw driver, from Jiri Pirko.

    26) Properly cap __skb_flow_dissect()'s return value, from Eric Dumazet.

    27) Fix long standing regression in ipv4 redirect handling, wrt.
    validating the new neighbour's reachability. From Stephen Suryaputra
    Lin.

    28) If sk_filter() trims the packet excessively, handle it reasonably in
    tcp input instead of exploding. From Eric Dumazet.

    29) Fix handling of napi hash state when copying channels in sfc driver,
    from Bert Kenward.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (121 commits)
    mlxsw: spectrum_router: Flush FIB tables during fini
    net: stmmac: Fix lack of link transition for fixed PHYs
    sctp: change sk state only when it has assocs in sctp_shutdown
    bnx2: Wait for in-flight DMA to complete at probe stage
    Revert "bnx2: Reset device during driver initialization"
    ps3_gelic: fix spelling mistake in debug message
    net: ethernet: ixp4xx_eth: fix spelling mistake in debug message
    ibmvnic: Fix size of debugfs name buffer
    ibmvnic: Unmap ibmvnic_statistics structure
    sfc: clear napi_hash state when copying channels
    mlxsw: spectrum_router: Correctly dump neighbour activity
    mlxsw: spectrum: Fix refcount bug on span entries
    bnxt_en: Fix VF virtual link state.
    bnxt_en: Fix ring arithmetic in bnxt_setup_tc().
    Revert "include/uapi/linux/atm_zatm.h: include linux/time.h"
    tcp: take care of truncations done by sk_filter()
    ipv4: use new_gw for redirect neigh lookup
    r8152: Fix error path in open function
    net: bpqether.h: remove if_ether.h guard
    net: __skb_flow_dissect() must cap its return value
    ...

    Linus Torvalds
     
  • When a module is first loaded and its function ip records are added to the
    ftrace list of functions to modify, they are set to DISABLED, as their text
    is still in a read only state. When the module is fully loaded, and can be
    updated, the flag is cleared, and if their's any functions that should be
    tracing them, it is updated at that moment.

    But there's several locations that do record accounting and should ignore
    records that are marked as disabled, or they can cause issues.

    Alexei already fixed one location, but others need to be addressed.

    Cc: stable@vger.kernel.org
    Fixes: b7ffffbb46f2 "ftrace: Add infrastructure for delayed enabling of module functions"
    Reported-by: Alexei Starovoitov
    Signed-off-by: Steven Rostedt

    Steven Rostedt (Red Hat)
     
  • ftrace_shutdown() checks for sanity of ftrace records
    and if dyn_ftrace->flags is not zero, it will warn.
    It can happen that 'flags' are set to FTRACE_FL_DISABLED at this point,
    since some module was loaded, but before ftrace_module_enable()
    cleared the flags for this module.

    In other words the module.c is doing:
    ftrace_module_init(mod); // calls ftrace_update_code() that sets flags=FTRACE_FL_DISABLED
    ... // here ftrace_shutdown() is called that warns, since
    err = prepare_coming_module(mod); // didn't have a chance to clear FTRACE_FL_DISABLED

    Fix it by ignoring disabled records.
    It's similar to what __ftrace_hash_rec_update() is already doing.

    Link: http://lkml.kernel.org/r/1478560460-3818619-1-git-send-email-ast@fb.com

    Cc: stable@vger.kernel.org
    Fixes: b7ffffbb46f2 "ftrace: Add infrastructure for delayed enabling of module functions"
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Steven Rostedt

    Alexei Starovoitov
     
  • This reverts commit bfd8d3f23b51018388be0411ccbc2d56277fe294.

    It turns out that this flushes things much too aggressiverly, and causes
    lines to break up when the system logger races with new continuation
    lines being printed.

    There's a pending patch to make printk() flushing much more
    straightforward, but it's too invasive for 4.9, so in the meantime let's
    just not make the system message logging flush continuation lines.
    They'll be flushed by the final newline anyway.

    Suggested-by: Petr Mladek
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Pull irq fix from Ingo Molnar:
    "This fixes a genirq regression that resulted in the Intel/Broxton
    pinctrl/GPIO driver (and possibly others) spewing warnings"

    * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    genirq: Use irq type from irqdata instead of irqdesc

    Linus Torvalds
     

12 Nov, 2016

3 commits

  • Pull power management fixes from Rafael Wysocki:
    "These fix two bugs in error code paths in the PM core (system-wide
    suspend of devices), a device reference leak in the boot-time suspend
    test code and a cpupower utility regression from the 4.7 cycle.

    Specifics:

    - Prevent the PM core from attempting to suspend parent devices if
    any of their children, whose suspend callbacks were invoked
    asynchronously, have failed to suspend during the "late" and
    "noirq" phases of system-wide suspend of devices (Brian Norris).

    - Prevent the boot-time system suspend test code from leaking a
    reference to the RTC device used by it (Johan Hovold).

    - Fix cpupower to use the return value of one of its library
    functions correctly and restore the correct behavior of it when
    used for setting cpufreq tunables broken during the 4.7 development
    cycle (Laura Abbott)"

    * tag 'pm-4.9-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
    PM / sleep: don't suspend parent when async child suspend_{noirq, late} fails
    PM / sleep: fix device reference leak in test_suspend
    cpupower: Correct return type of cpu_power_is_cpu_online() in cpufreq-set

    Linus Torvalds
     
  • * pm-tools-fixes:
    cpupower: Correct return type of cpu_power_is_cpu_online() in cpufreq-set

    * pm-sleep-fixes:
    PM / sleep: don't suspend parent when async child suspend_{noirq, late} fails
    PM / sleep: fix device reference leak in test_suspend

    Rafael J. Wysocki
     
  • This reverts commit 05fd007e4629 ("console: don't prefer first
    registered if DT specifies stdout-path").

    The reverted commit changes existing behavior on which many ARM boards
    rely. Many ARM small-board-computers, like e.g. the Raspberry Pi have
    both a video output and a serial console. Depending on whether the user
    is using the device as a more regular computer; or as a headless device
    we need to have the console on either one or the other.

    Many users rely on the kernel behavior of the console being present on
    both outputs, before the reverted commit the console setup with no
    console= kernel arguments on an ARM board which sets stdout-path in dt
    would look like this:

    [root@localhost ~]# cat /proc/consoles
    ttyS0 -W- (EC p a) 4:64
    tty0 -WU (E p ) 4:1

    Where as after the reverted commit, it looks like this:

    [root@localhost ~]# cat /proc/consoles
    ttyS0 -W- (EC p a) 4:64

    This commit reverts commit 05fd007e4629 ("console: don't prefer first
    registered if DT specifies stdout-path") restoring the original
    behavior.

    Fixes: 05fd007e4629 ("console: don't prefer first registered if DT specifies stdout-path")
    Link: http://lkml.kernel.org/r/20161104121135.4780-2-hdegoede@redhat.com
    Signed-off-by: Hans de Goede
    Cc: Paul Burton
    Cc: Rob Herring
    Cc: Frank Rowand
    Cc: Thorsten Leemhuis
    Cc: Greg Kroah-Hartman
    Cc: Tejun Heo
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hans de Goede
     

08 Nov, 2016

3 commits

  • The type flags in the irq descriptor are there for historical reasons and
    only updated via irq_modify_status() or irq_set_type(). Both functions also
    update the type flags in irqdata. __setup_irq() is the only left over user
    of the type flags in the irq descriptor.

    If __setup_irq() is called with empty irq type flags, then the type flags
    are retrieved from irqdata. If an interrupt is shared, then the type flags
    are compared with the type flags stored in the irq descriptor.

    On x86 the ioapic does not have a irq_set_type() callback because the type
    is defined in the BIOS tables and cannot be changed. The type is stored in
    irqdata at setup time without updating the type data in the irq
    descriptor. As a result the comparison described above fails.

    There is no point in updating the irq descriptor flags because the only
    relevant storage is irqdata. Use the type flags from irqdata for both
    retrieval and comparison in __setup_irq() instead.

    Aside of that the print out in case of non matching type flags has the old
    and new type flags arguments flipped. Fix that as well.

    For correctness sake the flags stored in the irq descriptor should be
    removed, but this is beyond the scope of this bugfix and will be done in a
    later patch.

    Fixes: 4b357daed698 ("genirq: Look-up trigger type if not specified by caller")
    Reported-and-tested-by: Mika Westerberg
    Signed-off-by: Thomas Gleixner
    Cc: Marc Zyngier
    Cc: Jon Hunter
    Cc: stable@vger.kernel.org
    Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1611072020360.3501@nanos
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • In map_create(), we first find and create the map, then once that
    suceeded, we charge it to the user's RLIMIT_MEMLOCK, and then fetch
    a new anon fd through anon_inode_getfd(). The problem is, once the
    latter fails f.e. due to RLIMIT_NOFILE limit, then we only destruct
    the map via map->ops->map_free(), but without uncharging the previously
    locked memory first. That means that the user_struct allocation is
    leaked as well as the accounted RLIMIT_MEMLOCK memory not released.
    Make the label names in the fix consistent with bpf_prog_load().

    Fixes: aaac3ba95e4c ("bpf: charge user for creation of BPF maps and programs")
    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Commit a6ed3ea65d98 ("bpf: restore behavior of bpf_map_update_elem")
    added an extra per-cpu reserve to the hash table map to restore old
    behaviour from pre prealloc times. When non-prealloc is in use for a
    map, then problem is that once a hash table extra element has been
    linked into the hash-table, and the hash table is destroyed due to
    refcount dropping to zero, then htab_map_free() -> delete_all_elements()
    will walk the whole hash table and drop all elements via htab_elem_free().
    The problem is that the element from the extra reserve is first fed
    to the wrong backend allocator and eventually freed twice.

    Fixes: a6ed3ea65d98 ("bpf: restore behavior of bpf_map_update_elem")
    Reported-by: Dmitry Vyukov
    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

06 Nov, 2016

1 commit

  • ….kernel.org/pub/scm/linux/kernel/git/tip/tip

    Pull stack vmap fixups from Thomas Gleixner:
    "Two small patches related to sched_show_task():

    - make sure to hold a reference on the task stack while accessing it

    - remove the thread_saved_pc printout

    .. and add a sanity check into release_task_stack() to catch problems
    with task stack references"

    * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched/core: Remove pointless printout in sched_show_task()
    sched/core: Fix oops in sched_show_task()

    * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    fork: Add task stack refcounting sanity check and prevent premature task stack freeing

    Linus Torvalds
     

04 Nov, 2016

1 commit

  • cgroupstats_cmd_get_policy is [CGROUPSTATS_CMD_ATTR_MAX+1],
    taskstats_cmd_get_policy[TASKSTATS_CMD_ATTR_MAX+1],
    but their family.maxattr is TASKSTATS_CMD_ATTR_MAX.
    CGROUPSTATS_CMD_ATTR_MAX is less than TASKSTATS_CMD_ATTR_MAX,
    so we could end up accessing out-of-bound.

    Change cgroupstats_cmd_get_policy to TASKSTATS_CMD_ATTR_MAX+1,
    this is safe because the rest are initialized to 0's.

    Reported-by: Andrey Konovalov
    Tested-by: Andrey Konovalov
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    WANG Cong
     

03 Nov, 2016

2 commits

  • In sched_show_task() we print out a useless hex number, not even a
    symbol, and there's a big question mark whether this even makes sense
    anyway, I suspect we should just remove it all.

    Signed-off-by: Linus Torvalds
    Acked-by: Andy Lutomirski
    Cc: Peter Zijlstra
    Cc: Tetsuo Handa
    Cc: Thomas Gleixner
    Cc: bp@alien8.de
    Cc: brgerst@gmail.com
    Cc: jann@thejh.net
    Cc: keescook@chromium.org
    Cc: linux-api@vger.kernel.org
    Cc: tycho.andersen@canonical.com
    Link: http://lkml.kernel.org/r/CA+55aFzphURPFzAvU4z6Moy7ZmimcwPuUdYU8bj9z0J+S8X1rw@mail.gmail.com
    Signed-off-by: Ingo Molnar

    Linus Torvalds
     
  • When CONFIG_THREAD_INFO_IN_TASK=y, it is possible that an exited thread
    remains in the task list after its stack pointer was already set to NULL.

    Therefore, thread_saved_pc() and stack_not_used() in sched_show_task()
    will trigger NULL pointer dereference if an attempt to dump such thread's
    traces (e.g. SysRq-t, khungtaskd) is made.

    Since show_stack() in sched_show_task() calls try_get_task_stack() and
    sched_show_task() is called from interrupt context, calling
    try_get_task_stack() from sched_show_task() will be safe as well.

    Signed-off-by: Tetsuo Handa
    Acked-by: Andy Lutomirski
    Acked-by: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: bp@alien8.de
    Cc: brgerst@gmail.com
    Cc: jann@thejh.net
    Cc: keescook@chromium.org
    Cc: linux-api@vger.kernel.org
    Cc: tycho.andersen@canonical.com
    Link: http://lkml.kernel.org/r/201611021950.FEJ34368.HFFJOOMLtQOVSF@I-love.SAKURA.ne.jp
    Signed-off-by: Ingo Molnar

    Tetsuo Handa
     

02 Nov, 2016

1 commit


01 Nov, 2016

1 commit

  • If something goes wrong with task stack refcounting and a stack
    refcount hits zero too early, warn and leak it rather than
    potentially freeing it early (and silently).

    Signed-off-by: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Josh Poimboeuf
    Cc: Linus Torvalds
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/f29119c783a9680a4b4656e751b6123917ace94b.1477926663.git.luto@kernel.org
    Signed-off-by: Ingo Molnar

    Andy Lutomirski
     

29 Oct, 2016

4 commits

  • Pull power management fixes from Rafael Wysocki:
    "These fix two intel_pstate issues related to the way it works when the
    scaling_governor sysfs attribute is set to "performance" and fix up
    messages in the system suspend core code.

    Specifics:

    - Fix a missing KERN_CONT in a system suspend message by converting
    the affected code to using pr_info() and pr_cont() instead of the
    "raw" printk() (Jon Hunter).

    - Make intel_pstate set the CPU P-state from its .set_policy()
    callback when the scaling_governor sysfs attribute is set to
    "performance" so that it interacts with NOHZ_FULL more predictably
    which was the case before 4.7 (Rafael Wysocki).

    - Make intel_pstate always request the maximum allowed P-state when
    the scaling_governor sysfs attribute is set to "performance" to
    prevent it from effectively ingoring that setting is some
    situations (Rafael Wysocki)"

    * tag 'pm-4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
    cpufreq: intel_pstate: Always set max P-state in performance mode
    PM / suspend: Fix missing KERN_CONT for suspend message
    cpufreq: intel_pstate: Set P-state upfront in performance mode

    Linus Torvalds
     
  • Pull perf fixes from Ingo Molnar:
    "Misc kernel fixes: a virtualization environment related fix, an uncore
    PMU driver removal handling fix, a PowerPC fix and new events for
    Knights Landing"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    perf/x86/intel: Honour the CPUID for number of fixed counters in hypervisors
    perf/powerpc: Don't call perf_event_disable() from atomic context
    perf/core: Protect PMU device removal with a 'pmu_bus_running' check, to fix CONFIG_DEBUG_TEST_DRIVER_REMOVE=y kernel panic
    perf/x86/intel/cstate: Add C-state residency events for Knights Landing

    Linus Torvalds
     
  • Pull timer fixes from Ingo Molnar:
    "Fix four timer locking races: two were noticed by Linus while
    reviewing the code while chasing for a corruption bug, and two
    from fixing spurious USB timeouts"

    * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    timers: Prevent base clock corruption when forwarding
    timers: Prevent base clock rewind when forwarding clock
    timers: Lock base for same bucket optimization
    timers: Plug locking race vs. timer migration

    Linus Torvalds
     
  • …-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

    Pull objtool, irq and scheduler fixes from Ingo Molnar:
    "One more objtool fixlet for GCC6 code generation patterns, an irq
    DocBook fix and an unused variable warning fix in the scheduler"

    * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    objtool: Fix rare switch jump table pattern detection

    * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    doc: Add missing parameter for msi_setup

    * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched/fair: Remove unused but set variable 'rq'

    Linus Torvalds
     

28 Oct, 2016

5 commits

  • The trinity syscall fuzzer triggered following WARN() on powerpc:

    WARNING: CPU: 9 PID: 2998 at arch/powerpc/kernel/hw_breakpoint.c:278
    ...
    NIP [c00000000093aedc] .hw_breakpoint_handler+0x28c/0x2b0
    LR [c00000000093aed8] .hw_breakpoint_handler+0x288/0x2b0
    Call Trace:
    [c0000002f7933580] [c00000000093aed8] .hw_breakpoint_handler+0x288/0x2b0 (unreliable)
    [c0000002f7933630] [c0000000000f671c] .notifier_call_chain+0x7c/0xf0
    [c0000002f79336d0] [c0000000000f6abc] .__atomic_notifier_call_chain+0xbc/0x1c0
    [c0000002f7933780] [c0000000000f6c40] .notify_die+0x70/0xd0
    [c0000002f7933820] [c00000000001a74c] .do_break+0x4c/0x100
    [c0000002f7933920] [c0000000000089fc] handle_dabr_fault+0x14/0x48

    Followed by a lockdep warning:

    ===============================
    [ INFO: suspicious RCU usage. ]
    4.8.0-rc5+ #7 Tainted: G W
    -------------------------------
    ./include/linux/rcupdate.h:556 Illegal context switch in RCU read-side critical section!

    other info that might help us debug this:

    rcu_scheduler_active = 1, debug_locks = 0
    2 locks held by ls/2998:
    #0: (rcu_read_lock){......}, at: [] .__atomic_notifier_call_chain+0x0/0x1c0
    #1: (rcu_read_lock){......}, at: [] .hw_breakpoint_handler+0x0/0x2b0

    stack backtrace:
    CPU: 9 PID: 2998 Comm: ls Tainted: G W 4.8.0-rc5+ #7
    Call Trace:
    [c0000002f7933150] [c00000000094b1f8] .dump_stack+0xe0/0x14c (unreliable)
    [c0000002f79331e0] [c00000000013c468] .lockdep_rcu_suspicious+0x138/0x180
    [c0000002f7933270] [c0000000001005d8] .___might_sleep+0x278/0x2e0
    [c0000002f7933300] [c000000000935584] .mutex_lock_nested+0x64/0x5a0
    [c0000002f7933410] [c00000000023084c] .perf_event_ctx_lock_nested+0x16c/0x380
    [c0000002f7933500] [c000000000230a80] .perf_event_disable+0x20/0x60
    [c0000002f7933580] [c00000000093aeec] .hw_breakpoint_handler+0x29c/0x2b0
    [c0000002f7933630] [c0000000000f671c] .notifier_call_chain+0x7c/0xf0
    [c0000002f79336d0] [c0000000000f6abc] .__atomic_notifier_call_chain+0xbc/0x1c0
    [c0000002f7933780] [c0000000000f6c40] .notify_die+0x70/0xd0
    [c0000002f7933820] [c00000000001a74c] .do_break+0x4c/0x100
    [c0000002f7933920] [c0000000000089fc] handle_dabr_fault+0x14/0x48

    While it looks like the first WARN() is probably valid, the other one is
    triggered by disabling event via perf_event_disable() from atomic context.

    The event is disabled here in case we were not able to emulate
    the instruction that hit the breakpoint. By disabling the event
    we unschedule the event and make sure it's not scheduled back.

    But we can't call perf_event_disable() from atomic context, instead
    we need to use the event's pending_disable irq_work method to disable it.

    Reported-by: Jan Stancek
    Signed-off-by: Jiri Olsa
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Huang Ying
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Michael Neuling
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20161026094824.GA21397@krava
    Signed-off-by: Ingo Molnar

    Jiri Olsa
     
  • …FIG_DEBUG_TEST_DRIVER_REMOVE=y kernel panic

    CAI Qian reported a crash in the PMU uncore device removal code,
    enabled by the CONFIG_DEBUG_TEST_DRIVER_REMOVE=y option:

    https://marc.info/?l=linux-kernel&m=147688837328451

    The reason for the crash is that perf_pmu_unregister() tries to remove
    a PMU device which is not added at this point. We add PMU devices
    only after pmu_bus is registered, which happens in the
    perf_event_sysfs_init() call and sets the 'pmu_bus_running' flag.

    The fix is to get the 'pmu_bus_running' flag state at the point
    the PMU is taken out of the PMU list and remove the device
    later only if it's set.

    Reported-by: CAI Qian <caiqian@redhat.com>
    Tested-by: CAI Qian <caiqian@redhat.com>
    Signed-off-by: Jiri Olsa <jolsa@kernel.org>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
    Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: Jiri Olsa <jolsa@redhat.com>
    Cc: Kan Liang <kan.liang@intel.com>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Rob Herring <robh@kernel.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Link: http://lkml.kernel.org/r/20161020111011.GA13361@krava
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

    Jiri Olsa
     
  • in_interrupt() returns a nonzero value when we are either in an
    interrupt or have bh disabled via local_bh_disable(). Since we are
    interested in only ignoring coverage from actual interrupts, do a proper
    check instead of just calling in_interrupt().

    As a result of this change, kcov will start to collect coverage from
    within local_bh_disable()/local_bh_enable() sections.

    Link: http://lkml.kernel.org/r/1476115803-20712-1-git-send-email-andreyknvl@google.com
    Signed-off-by: Andrey Konovalov
    Acked-by: Dmitry Vyukov
    Cc: Nicolai Stange
    Cc: Andrey Ryabinin
    Cc: Kees Cook
    Cc: James Morse
    Cc: Vegard Nossum
    Cc: Quentin Casasnovas
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Konovalov
     
  • Pull block fixes from Jens Axboe:
    "A set of fixes for this series, most notably the fix for the blk-mq
    software queue regression in from this merge window.

    Apart from that, a fix for an unlikely hang if a queue is flooded with
    FUA requests from Ming, and a few small fixes for nbd and badblocks.
    Lastly, a rename update for the proc softirq output, since the block
    polling code was made generic"

    * 'for-linus' of git://git.kernel.dk/linux-block:
    blk-mq: update hardware and software queues for sleeping alloc
    block: flush: fix IO hang in case of flood fua req
    nbd: fix incorrect unlock of nbd->sock_lock in sock_shutdown
    badblocks: badblocks_set/clear update unacked_exist
    softirq: Display IRQ_POLL for irq-poll statistics

    Linus Torvalds
     
  • The per-zone waitqueues exist because of a scalability issue with the
    page waitqueues on some NUMA machines, but it turns out that they hurt
    normal loads, and now with the vmalloced stacks they also end up
    breaking gfs2 that uses a bit_wait on a stack object:

    wait_on_bit(&gh->gh_iflags, HIF_WAIT, TASK_UNINTERRUPTIBLE)

    where 'gh' can be a reference to the local variable 'mount_gh' on the
    stack of fill_super().

    The reason the per-zone hash table breaks for this case is that there is
    no "zone" for virtual allocations, and trying to look up the physical
    page to get at it will fail (with a BUG_ON()).

    It turns out that I actually complained to the mm people about the
    per-zone hash table for another reason just a month ago: the zone lookup
    also hurts the regular use of "unlock_page()" a lot, because the zone
    lookup ends up forcing several unnecessary cache misses and generates
    horrible code.

    As part of that earlier discussion, we had a much better solution for
    the NUMA scalability issue - by just making the page lock have a
    separate contention bit, the waitqueue doesn't even have to be looked at
    for the normal case.

    Peter Zijlstra already has a patch for that, but let's see if anybody
    even notices. In the meantime, let's fix the actual gfs2 breakage by
    simplifying the bitlock waitqueues and removing the per-zone issue.

    Reported-by: Andreas Gruenbacher
    Tested-by: Bob Peterson
    Acked-by: Mel Gorman
    Cc: Peter Zijlstra
    Cc: Andy Lutomirski
    Cc: Steven Whitehouse
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

27 Oct, 2016

1 commit

  • Since commit:

    8663e24d56dc ("sched/fair: Reorder cgroup creation code")

    ... the variable 'rq' in alloc_fair_sched_group() is set but no longer used.
    Remove it to fix the following GCC warning when building with 'W=1':

    kernel/sched/fair.c:8842:13: warning: variable ‘rq’ set but not used [-Wunused-but-set-variable]

    Signed-off-by: Tobias Klauser
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20161026113704.8981-1-tklauser@distanz.ch
    Signed-off-by: Ingo Molnar

    Tobias Klauser
     

25 Oct, 2016

2 commits

  • When a timer is enqueued we try to forward the timer base clock. This
    mechanism has two issues:

    1) Forwarding a remote base unlocked

    The forwarding function is called from get_target_base() with the current
    timer base lock held. But if the new target base is a different base than
    the current base (can happen with NOHZ, sigh!) then the forwarding is done
    on an unlocked base. This can lead to corruption of base->clk.

    Solution is simple: Invoke the forwarding after the target base is locked.

    2) Possible corruption due to jiffies advancing

    This is similar to the issue in get_net_timer_interrupt() which was fixed
    in the previous patch. jiffies can advance between check and assignement
    and therefore advancing base->clk beyond the next expiry value.

    So we need to read jiffies into a local variable once and do the checks and
    assignment with the local copy.

    Fixes: a683f390b93f("timers: Forward the wheel clock whenever possible")
    Reported-by: Ashton Holmes
    Reported-by: Michael Thayer
    Signed-off-by: Thomas Gleixner
    Cc: Michal Necasek
    Cc: Peter Zijlstra
    Cc: knut.osmundsen@oracle.com
    Cc: stable@vger.kernel.org
    Cc: stern@rowland.harvard.edu
    Cc: rt@linutronix.de
    Link: http://lkml.kernel.org/r/20161022110552.253640125@linutronix.de
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • Ashton and Michael reported, that kernel versions 4.8 and later suffer from
    USB timeouts which are caused by the timer wheel rework.

    This is caused by a bug in the base clock forwarding mechanism, which leads
    to timers expiring early. The scenario which leads to this is:

    run_timers()
    while (jiffies >= base->clk) {
    collect_expired_timers();
    base->clk++;
    expire_timers();
    }

    So base->clk = jiffies + 1. Now the cpu goes idle:

    idle()
    get_next_timer_interrupt()
    nextevt = __next_time_interrupt();
    if (time_after(nextevt, base->clk))
    base->clk = jiffies;

    jiffies has not advanced since run_timers(), so this assignment effectively
    decrements base->clk by one.

    base->clk is the index into the timer wheel arrays. So let's assume the
    following state after the base->clk increment in run_timers():

    jiffies = 0
    base->clk = 1

    A timer gets enqueued with an expiry delta of 63 ticks (which is the case
    with the USB timeout and HZ=250) so the resulting bucket index is:

    base->clk + delta = 1 + 63 = 64

    The timer goes into the first wheel level. The array size is 64 so it ends
    up in bucket 0, which is correct as it takes 63 ticks to advance base->clk
    to index into bucket 0 again.

    If the cpu goes idle before jiffies advance, then the bug in the forwarding
    mechanism sets base->clk back to 0, so the next invocation of run_timers()
    at the next tick will index into bucket 0 and therefore expire the timer 62
    ticks too early.

    Instead of blindly setting base->clk to jiffies we must make the forwarding
    conditional on jiffies > base->clk, but we cannot use jiffies for this as
    we might run into the following issue:

    if (time_after(jiffies, base->clk) {
    if (time_after(nextevt, base->clk))
    base->clk = jiffies;

    jiffies can increment between the check and the assigment far enough to
    advance beyond nextevt. So we need to use a stable value for checking.

    get_next_timer_interrupt() has the basej argument which is the jiffies
    value snapshot taken in the calling code. So we can just that.

    Thanks to Ashton for bisecting and providing trace data!

    Fixes: a683f390b93f ("timers: Forward the wheel clock whenever possible")
    Reported-by: Ashton Holmes
    Reported-by: Michael Thayer
    Signed-off-by: Thomas Gleixner
    Cc: Michal Necasek
    Cc: Peter Zijlstra
    Cc: knut.osmundsen@oracle.com
    Cc: stable@vger.kernel.org
    Cc: stern@rowland.harvard.edu
    Cc: rt@linutronix.de
    Link: http://lkml.kernel.org/r/20161022110552.175308322@linutronix.de
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner