29 Jan, 2018

1 commit


26 Jan, 2018

1 commit

  • Some dst_ops (e.g. md_dst_ops)) doesn't set this handler. It may result to:
    "BUG: unable to handle kernel NULL pointer dereference at (null)"

    Let's add a helper to check if update_pmtu is available before calling it.

    Fixes: 52a589d51f10 ("geneve: update skb dst pmtu on tx path")
    Fixes: a93bf0ff4490 ("vxlan: update skb dst pmtu on tx path")
    CC: Roman Kapl
    CC: Xin Long
    Signed-off-by: Nicolas Dichtel
    Signed-off-by: David S. Miller

    Nicolas Dichtel
     

25 Jan, 2018

5 commits

  • When a tcp socket is closed, if it detects that its net namespace is
    exiting, close immediately and do not wait for FIN sequence.

    For normal sockets, a reference is taken to their net namespace, so it will
    never exit while the socket is open. However, kernel sockets do not take a
    reference to their net namespace, so it may begin exiting while the kernel
    socket is still open. In this case if the kernel socket is a tcp socket,
    it will stay open trying to complete its close sequence. The sock's dst(s)
    hold a reference to their interface, which are all transferred to the
    namespace's loopback interface when the real interfaces are taken down.
    When the namespace tries to take down its loopback interface, it hangs
    waiting for all references to the loopback interface to release, which
    results in messages like:

    unregister_netdevice: waiting for lo to become free. Usage count = 1

    These messages continue until the socket finally times out and closes.
    Since the net namespace cleanup holds the net_mutex while calling its
    registered pernet callbacks, any new net namespace initialization is
    blocked until the current net namespace finishes exiting.

    After this change, the tcp socket notices the exiting net namespace, and
    closes immediately, releasing its dst(s) and their reference to the
    loopback interface, which lets the net namespace continue exiting.

    Link: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407
    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=97811
    Signed-off-by: Dan Streetman
    Signed-off-by: David S. Miller

    Dan Streetman
     
  • Pull networking fixes from David Miller:

    1) Avoid negative netdev refcount in error flow of xfrm state add, from
    Aviad Yehezkel.

    2) Fix tcpdump decoding of IPSEC decap'd frames by filling in the
    ethernet header protocol field in xfrm{4,6}_mode_tunnel_input().
    From Yossi Kuperman.

    3) Fix a syzbot triggered skb_under_panic in pppoe having to do with
    failing to allocate an appropriate amount of headroom. From
    Guillaume Nault.

    4) Fix memory leak in vmxnet3 driver, from Neil Horman.

    5) Cure out-of-bounds packet memory access in em_nbyte EMATCH module,
    from Wolfgang Bumiller.

    6) Restrict what kinds of sockets can be bound to the KCM multiplexer
    and also disallow when another layer has attached to the socket and
    made use of sk_user_data. From Tom Herbert.

    7) Fix use before init of IOTLB in vhost code, from Jason Wang.

    8) Correct STACR register write bit definition in IBM emac driver, from
    Ivan Mikhaylov.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
    net/ibm/emac: wrong bit is used for STA control register write
    net/ibm/emac: add 8192 rx/tx fifo size
    vhost: do not try to access device IOTLB when not initialized
    vhost: use mutex_lock_nested() in vhost_dev_lock_vqs()
    i40e: flower: check if TC offload is enabled on a netdev
    qed: Free reserved MR tid
    qed: Remove reserveration of dpi for kernel
    kcm: Check if sk_user_data already set in kcm_attach
    kcm: Only allow TCP sockets to be attached to a KCM mux
    net: sched: fix TCF_LAYER_LINK case in tcf_get_base_ptr
    net: sched: em_nbyte: don't add the data offset twice
    mlxsw: spectrum_router: Don't log an error on missing neighbor
    vmxnet3: repair memory leak
    ipv6: Fix getsockopt() for sockets with default IPV6_AUTOFLOWLABEL
    pppoe: take ->needed_headroom of lower device into account on xmit
    xfrm: fix boolean assignment in xfrm_get_type_offload
    xfrm: Fix eth_hdr(skb)->h_proto to reflect inner IP version
    xfrm: fix error flow in case of add state fails
    xfrm: Add SA to hardware at the end of xfrm_state_construct()

    Linus Torvalds
     
  • TCF_LAYER_LINK and TCF_LAYER_NETWORK returned the same pointer as
    skb->data points to the network header.
    Use skb_mac_header instead.

    Signed-off-by: Wolfgang Bumiller
    Signed-off-by: David S. Miller

    Wolfgang Bumiller
     
  • Pull tracing fixes from Steven Rostedt:
    "With the new ORC unwinder, ftrace stack tracing became disfunctional.

    One was that ORC didn't know how to handle the ftrace callbacks in
    general (which Josh fixed).

    The other was that ORC would just bail if it hit a dynamically
    allocated trampoline. Which means all ftrace stack tracing that
    happens from the function tracer would produce no results (that
    includes killing the max stack size tracer). I added a check to the
    ORC unwinder to see if the trampoline belonged to ftrace, and if it
    did, use the orc entry of the static trampoline that was used to
    create the dynamic one (it would be identical).

    Finally, I noticed that the skip values of the stack tracing were out
    of whack. I went through and fixed them up"

    * tag 'trace-v4.15-rc9' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    tracing: Update stack trace skipping for ORC unwinder
    ftrace, orc, x86: Handle ftrace dynamically allocated trampolines
    x86/ftrace: Fix ORC unwinding from ftrace handlers

    Linus Torvalds
     
  • This reverts commit 6cfb521ac0d5b97470883ff9b7facae264b7ab12.

    Turns out distros do not want to make retpoline as part of their "ABI",
    so this patch should not have been merged. Sorry Andi, this was my
    fault, I suggested it when your original patch was the "correct" way of
    doing this instead.

    Reported-by: Jiri Kosina
    Fixes: 6cfb521ac0d5 ("module: Add retpoline tag to VERMAGIC")
    Acked-by: Andi Kleen
    Cc: Thomas Gleixner
    Cc: David Woodhouse
    Cc: rusty@rustcorp.com.au
    Cc: arjan.van.de.ven@intel.com
    Cc: jeyu@kernel.org
    Cc: stable
    Signed-off-by: Greg Kroah-Hartman
    Signed-off-by: Linus Torvalds

    Greg Kroah-Hartman
     

24 Jan, 2018

3 commits

  • Tejun reported the following cpu-hotplug lock (percpu-rwsem) read recursion:

    tg_set_cfs_bandwidth()
    get_online_cpus()
    cpus_read_lock()

    cfs_bandwidth_usage_inc()
    static_key_slow_inc()
    cpus_read_lock()

    Reported-by: Tejun Heo
    Tested-by: Tejun Heo
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20180122215328.GP3397@worktop
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Commit 513674b5a2c9 ("net: reevalulate autoflowlabel setting after
    sysctl setting") removed the initialisation of
    ipv6_pinfo::autoflowlabel and added a second flag to indicate
    whether this field or the net namespace default should be used.

    The getsockopt() handling for this case was not updated, so it
    currently returns 0 for all sockets for which IPV6_AUTOFLOWLABEL is
    not explicitly enabled. Fix it to return the effective value, whether
    that has been set at the socket or net namespace level.

    Fixes: 513674b5a2c9 ("net: reevalulate autoflowlabel setting after sysctl ...")
    Signed-off-by: Ben Hutchings
    Signed-off-by: David S. Miller

    Ben Hutchings
     
  • The function tracer can create a dynamically allocated trampoline that is
    called by the function mcount or fentry hook that is used to call the
    function callback that is registered. The problem is that the orc undwinder
    will bail if it encounters one of these trampolines. This breaks the stack
    trace of function callbacks, which include the stack tracer and setting the
    stack trace for individual functions.

    Since these dynamic trampolines are basically copies of the static ftrace
    trampolines defined in ftrace_*.S, we do not need to create new orc entries
    for the dynamic trampolines. Finding the return address on the stack will be
    identical as the functions that were copied to create the dynamic
    trampolines. When encountering a ftrace dynamic trampoline, we can just use
    the orc entry of the ftrace static function that was copied for that
    trampoline.

    Signed-off-by: Steven Rostedt (VMware)

    Steven Rostedt (VMware)
     

22 Jan, 2018

1 commit

  • Tetsuo reported random crashes under memory pressure on 32-bit x86
    system and tracked down to change that introduced
    page_vma_mapped_walk().

    The root cause of the issue is the faulty pointer math in check_pte().
    As ->pte may point to an arbitrary page we have to check that they are
    belong to the section before doing math. Otherwise it may lead to weird
    results.

    It wasn't noticed until now as mem_map[] is virtually contiguous on
    flatmem or vmemmap sparsemem. Pointer arithmetic just works against all
    'struct page' pointers. But with classic sparsemem, it doesn't because
    each section memap is allocated separately and so consecutive pfns
    crossing two sections might have struct pages at completely unrelated
    addresses.

    Let's restructure code a bit and replace pointer arithmetic with
    operations on pfns.

    Signed-off-by: Kirill A. Shutemov
    Reported-and-tested-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Fixes: ace71a19cec5 ("mm: introduce page_vma_mapped_walk()")
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

21 Jan, 2018

2 commits

  • Pull KVM fixes from Radim Krčmář:
    "ARM:
    - fix incorrect huge page mappings on systems using the contiguous
    hint for hugetlbfs
    - support alternative GICv4 init sequence
    - correctly implement the ARM SMCC for HVC and SMC handling

    PPC:
    - add KVM IOCTL for reporting vulnerability and workaround status

    s390:
    - provide userspace interface for branch prediction changes in
    firmware

    x86:
    - use correct macros for bits"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
    KVM: s390: wire up bpb feature
    KVM: PPC: Book3S: Provide information about hardware/firmware CVE workarounds
    KVM/x86: Fix wrong macro references of X86_CR0_PG_BIT and X86_CR4_PAE_BIT in kvm_valid_sregs()
    arm64: KVM: Fix SMCCC handling of unimplemented SMC/HVC calls
    KVM: arm64: Fix GICv4 init when called from vgic_its_create
    KVM: arm/arm64: Check pagesize when allocating a hugepage at Stage 2

    Linus Torvalds
     
  • The new firmware interfaces for branch prediction behaviour changes
    are transparently available for the guest. Nevertheless, there is
    new state attached that should be migrated and properly resetted.
    Provide a mechanism for handling reset, migration and VSIE.

    Signed-off-by: Christian Borntraeger
    Reviewed-by: David Hildenbrand
    Reviewed-by: Cornelia Huck
    [Changed capability number to 152. - Radim]
    Signed-off-by: Radim Krčmář

    Christian Borntraeger
     

20 Jan, 2018

1 commit

  • Without this patch, I drown in a sea of unknown attribute warnings

    Link: http://lkml.kernel.org/r/20180117024539.27354-1-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Acked-by: Kees Cook
    Cc: Ingo Molnar
    Cc: Josh Poimboeuf
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     

19 Jan, 2018

1 commit

  • This adds a new ioctl, KVM_PPC_GET_CPU_CHAR, that gives userspace
    information about the underlying machine's level of vulnerability
    to the recently announced vulnerabilities CVE-2017-5715,
    CVE-2017-5753 and CVE-2017-5754, and whether the machine provides
    instructions to assist software to work around the vulnerabilities.

    The ioctl returns two u64 words describing characteristics of the
    CPU and required software behaviour respectively, plus two mask
    words which indicate which bits have been filled in by the kernel,
    for extensibility. The bit definitions are the same as for the
    new H_GET_CPU_CHARACTERISTICS hypercall.

    There is also a new capability, KVM_CAP_PPC_GET_CPU_CHAR, which
    indicates whether the new ioctl is available.

    Signed-off-by: Paul Mackerras

    Paul Mackerras
     

18 Jan, 2018

2 commits

  • Pull scheduler fix from Ingo Molnar:
    "A delayacct statistics correctness fix"

    * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    delayacct: Account blkio completion on the correct task

    Linus Torvalds
     
  • Pull x86 pti bits and fixes from Thomas Gleixner:
    "This last update contains:

    - An objtool fix to prevent a segfault with the gold linker by
    changing the invocation order. That's not just for gold, it's a
    general robustness improvement.

    - An improved error message for objtool which spares tearing hairs.

    - Make KASAN fail loudly if there is not enough memory instead of
    oopsing at some random place later

    - RSB fill on context switch to prevent RSB underflow and speculation
    through other units.

    - Make the retpoline/RSB functionality work reliably for both Intel
    and AMD

    - Add retpoline to the module version magic so mismatch can be
    detected

    - A small (non-fix) update for cpufeatures which prevents cpu feature
    clashing for the upcoming extra mitigation bits to ease
    backporting"

    * 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    module: Add retpoline tag to VERMAGIC
    x86/cpufeature: Move processor tracing out of scattered features
    objtool: Improve error message for bad file argument
    objtool: Fix seg fault with gold linker
    x86/retpoline: Add LFENCE to the retpoline/RSB filling RSB macros
    x86/retpoline: Fill RSB on context switch for affected CPUs
    x86/kasan: Panic if there is not enough memory to boot

    Linus Torvalds
     

17 Jan, 2018

4 commits

  • Add a marker for retpoline to the module VERMAGIC. This catches the case
    when a non RETPOLINE compiled module gets loaded into a retpoline kernel,
    making it insecure.

    It doesn't handle the case when retpoline has been runtime disabled. Even
    in this case the match of the retcompile status will be enforced. This
    implies that even with retpoline run time disabled all modules loaded need
    to be recompiled.

    Signed-off-by: Andi Kleen
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Greg Kroah-Hartman
    Acked-by: David Woodhouse
    Cc: rusty@rustcorp.com.au
    Cc: arjan.van.de.ven@intel.com
    Cc: jeyu@kernel.org
    Cc: torvalds@linux-foundation.org
    Link: https://lkml.kernel.org/r/20180116205228.4890-1-andi@firstfloor.org

    Andi Kleen
     
  • Pull networking fixes from David Miller:

    1) Two read past end of buffer fixes in AF_KEY, from Eric Biggers.

    2) Memory leak in key_notify_policy(), from Steffen Klassert.

    3) Fix overflow with bpf arrays, from Daniel Borkmann.

    4) Fix RDMA regression with mlx5 due to mlx5 no longer using
    pci_irq_get_affinity(), from Saeed Mahameed.

    5) Missing RCU read locking in nl80211_send_iface() when it calls
    ieee80211_bss_get_ie(), from Dominik Brodowski.

    6) cfg80211 should check dev_set_name()'s return value, from Johannes
    Berg.

    7) Missing module license tag in 9p protocol, from Stephen Hemminger.

    8) Fix crash due to too small MTU in udp ipv6 sendmsg, from Mike
    Maloney.

    9) Fix endless loop in netlink extack code, from David Ahern.

    10) TLS socket layer sets inverted error codes, resulting in an endless
    loop. From Robert Hering.

    11) Revert openvswitch erspan tunnel support, it's mis-designed and we
    need to kill it before it goes into a real release. From William Tu.

    12) Fix lan78xx failures in full speed USB mode, from Yuiko Oshino.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (54 commits)
    net, sched: fix panic when updating miniq {b,q}stats
    qed: Fix potential use-after-free in qed_spq_post()
    nfp: use the correct index for link speed table
    lan78xx: Fix failure in USB Full Speed
    sctp: do not allow the v4 socket to bind a v4mapped v6 address
    sctp: return error if the asoc has been peeled off in sctp_wait_for_sndbuf
    sctp: reinit stream if stream outcnt has been change by sinit in sendmsg
    ibmvnic: Fix pending MAC address changes
    netlink: extack: avoid parenthesized string constant warning
    ipv4: Make neigh lookup keys for loopback/point-to-point devices be INADDR_ANY
    net: Allow neigh contructor functions ability to modify the primary_key
    sh_eth: fix dumping ARSTR
    Revert "openvswitch: Add erspan tunnel support."
    net/tls: Fix inverted error codes to avoid endless loop
    ipv6: ip6_make_skb() needs to clear cork.base.dst
    sctp: avoid compiler warning on implicit fallthru
    net: ipv4: Make "ip route get" match iif lo rules again.
    netlink: extack needs to be reset each time through loop
    tipc: fix a memory leak in tipc_nl_node_get_link()
    ipv6: fix udpv6 sendmsg crash caused by too small MTU
    ...

    Linus Torvalds
     
  • While working on fixing another bug, I ran into the following panic
    on arm64 by simply attaching clsact qdisc, adding a filter and running
    traffic on ingress to it:

    [...]
    [ 178.188591] Unable to handle kernel read from unreadable memory at virtual address 810fb501f000
    [ 178.197314] Mem abort info:
    [ 178.200121] ESR = 0x96000004
    [ 178.203168] Exception class = DABT (current EL), IL = 32 bits
    [ 178.209095] SET = 0, FnV = 0
    [ 178.212157] EA = 0, S1PTW = 0
    [ 178.215288] Data abort info:
    [ 178.218175] ISV = 0, ISS = 0x00000004
    [ 178.222019] CM = 0, WnR = 0
    [ 178.224997] user pgtable: 4k pages, 48-bit VAs, pgd = 0000000023cb3f33
    [ 178.231531] [0000810fb501f000] *pgd=0000000000000000
    [ 178.236508] Internal error: Oops: 96000004 [#1] SMP
    [...]
    [ 178.311855] CPU: 73 PID: 2497 Comm: ping Tainted: G W 4.15.0-rc7+ #5
    [ 178.319413] Hardware name: FOXCONN R2-1221R-A4/C2U4N_MB, BIOS G31FB18A 03/31/2017
    [ 178.326887] pstate: 60400005 (nZCv daif +PAN -UAO)
    [ 178.331685] pc : __netif_receive_skb_core+0x49c/0xac8
    [ 178.336728] lr : __netif_receive_skb+0x28/0x78
    [ 178.341161] sp : ffff00002344b750
    [ 178.344465] x29: ffff00002344b750 x28: ffff810fbdfd0580
    [ 178.349769] x27: 0000000000000000 x26: ffff000009378000
    [...]
    [ 178.418715] x1 : 0000000000000054 x0 : 0000000000000000
    [ 178.424020] Process ping (pid: 2497, stack limit = 0x000000009f0a3ff4)
    [ 178.430537] Call trace:
    [ 178.432976] __netif_receive_skb_core+0x49c/0xac8
    [ 178.437670] __netif_receive_skb+0x28/0x78
    [ 178.441757] process_backlog+0x9c/0x160
    [ 178.445584] net_rx_action+0x2f8/0x3f0
    [...]

    Reason is that sch_ingress and sch_clsact are doing mini_qdisc_pair_init()
    which sets up miniq pointers to cpu_{b,q}stats from the underlying qdisc.
    Problem is that this cannot work since they are actually set up right after
    the qdisc ->init() callback in qdisc_create(), so first packet going into
    sch_handle_ingress() tries to call mini_qdisc_bstats_cpu_update() and we
    therefore panic.

    In order to fix this, allocation of {b,q}stats needs to happen before we
    call into ->init(). In net-next, there's already such option through commit
    d59f5ffa59d8 ("net: sched: a dflt qdisc may be used with per cpu stats").
    However, the bug needs to be fixed in net still for 4.15. Thus, include
    these bits to reduce any merge churn and reuse the static_flags field to
    set TCQ_F_CPUSTATS, and remove the allocation from qdisc_create() since
    there is no other user left. Prashant Bhole ran into the same issue but
    for net-next, thus adding him below as well as co-author. Same issue was
    also reported by Sandipan Das when using bcc.

    Fixes: 46209401f8f6 ("net: core: introduce mini_Qdisc and eliminate usage of tp->q for clsact fastpath")
    Reference: https://lists.iovisor.org/pipermail/iovisor-dev/2018-January/001190.html
    Reported-by: Sandipan Das
    Co-authored-by: Prashant Bhole
    Co-authored-by: John Fastabend
    Signed-off-by: Daniel Borkmann
    Cc: Jiri Pirko
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • …kernel/git/jberg/mac80211

    Johannes Berg says:

    ====================
    More fixes:
    * hwsim:
    - properly flush deletion works at module unload
    - validate # of channels passed from userspace
    * cfg80211:
    - fix RCU locking regression
    - initialize on-stack channel data for nl80211 event
    - check dev_set_name() return value
    ====================

    Signed-off-by: David S. Miller <davem@davemloft.net>

    David S. Miller
     

16 Jan, 2018

6 commits

  • Before commit:

    e33a9bba85a8 ("sched/core: move IO scheduling accounting from io_schedule_timeout() into scheduler")

    delayacct_blkio_end() was called after context-switching into the task which
    completed I/O.

    This resulted in double counting: the task would account a delay both waiting
    for I/O and for time spent in the runqueue.

    With e33a9bba85a8, delayacct_blkio_end() is called by try_to_wake_up().
    In ttwu, we have not yet context-switched. This is more correct, in that
    the delay accounting ends when the I/O is complete.

    But delayacct_blkio_end() relies on 'get_current()', and we have not yet
    context-switched into the task whose I/O completed. This results in the
    wrong task having its delay accounting statistics updated.

    Instead of doing that, pass the task_struct being woken to delayacct_blkio_end(),
    so that it can update the statistics of the correct task.

    Signed-off-by: Josh Snyder
    Acked-by: Tejun Heo
    Acked-by: Balbir Singh
    Cc:
    Cc: Brendan Gregg
    Cc: Jens Axboe
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-block@vger.kernel.org
    Fixes: e33a9bba85a8 ("sched/core: move IO scheduling accounting from io_schedule_timeout() into scheduler")
    Link: http://lkml.kernel.org/r/1513613712-571-1-git-send-email-joshs@netflix.com
    Signed-off-by: Ingo Molnar

    Josh Snyder
     
  • NL_SET_ERR_MSG() and NL_SET_ERR_MSG_ATTR() lead to the following warning
    in newer versions of gcc:
    warning: array initialized from parenthesized string constant

    Just remove the parentheses, they're not needed in this context since
    anyway since there can be no operator precendence issues or similar.

    Signed-off-by: Johannes Berg
    Signed-off-by: David S. Miller

    Johannes Berg
     
  • Map all lookup neigh keys to INADDR_ANY for loopback/point-to-point devices
    to avoid making an entry for every remote ip the device needs to talk to.

    This used the be the old behavior but became broken in a263b3093641f
    (ipv4: Make neigh lookups directly in output packet path) and later removed
    in 0bb4087cbec0 (ipv4: Fix neigh lookup keying over loopback/point-to-point
    devices) because it was broken.

    Signed-off-by: Jim Westfall
    Signed-off-by: David S. Miller

    Jim Westfall
     
  • This reverts commit ceaa001a170e43608854d5290a48064f57b565ed.

    The OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS attr should be designed
    as a nested attribute to support all ERSPAN v1 and v2's fields.
    The current attr is a be32 supporting only one field. Thus, this
    patch reverts it and later patch will redo it using nested attr.

    Signed-off-by: William Tu
    Cc: Jiri Benc
    Cc: Pravin Shelar
    Acked-by: Jiri Benc
    Acked-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    William Tu
     
  • sendfile() calls can hang endless with using Kernel TLS if a socket error occurs.
    Socket error codes must be inverted by Kernel TLS before returning because
    they are stored with positive sign. If returned non-inverted they are
    interpreted as number of bytes sent, causing endless looping of the
    splice mechanic behind sendfile().

    Signed-off-by: Robert Hering
    Signed-off-by: David S. Miller

    r.hering@avm.de
     
  • This explains why is the net usage of __ptr_ring_peek
    actually ok without locks.

    Signed-off-by: Michael S. Tsirkin
    Acked-by: John Fastabend
    Signed-off-by: David S. Miller

    Michael S. Tsirkin
     

15 Jan, 2018

2 commits

  • When creating a new radio on the fly, hwsim allows this
    to be done with an arbitrary number of channels, but
    cfg80211 only supports a limited number of simultaneous
    channels, leading to a warning.

    Fix this by validating the number - this requires moving
    the define for the maximum out to a visible header file.

    Reported-by: syzbot+8dd9051ff19940290931@syzkaller.appspotmail.com
    Fixes: b59ec8dd4394 ("mac80211_hwsim: fix number of channels in interface combinations")
    Signed-off-by: Johannes Berg

    Johannes Berg
     
  • Pull x86 pti updates from Thomas Gleixner:
    "This contains:

    - a PTI bugfix to avoid setting reserved CR3 bits when PCID is
    disabled. This seems to cause issues on a virtual machine at least
    and is incorrect according to the AMD manual.

    - a PTI bugfix which disables the perf BTS facility if PTI is
    enabled. The BTS AUX buffer is not globally visible and causes the
    CPU to fault when the mapping disappears on switching CR3 to user
    space. A full fix which restores BTS on PTI is non trivial and will
    be worked on.

    - PTI bugfixes for EFI and trusted boot which make sure that the user
    space visible page table entries have the NX bit cleared

    - removal of dead code in the PTI pagetable setup functions

    - add PTI documentation

    - add a selftest for vsyscall to verify that the kernel actually
    implements what it advertises.

    - a sysfs interface to expose vulnerability and mitigation
    information so there is a coherent way for users to retrieve the
    status.

    - the initial spectre_v2 mitigations, aka retpoline:

    + The necessary ASM thunk and compiler support

    + The ASM variants of retpoline and the conversion of affected ASM
    code

    + Make LFENCE serializing on AMD so it can be used as speculation
    trap

    + The RSB fill after vmexit

    - initial objtool support for retpoline

    As I said in the status mail this is the most of the set of patches
    which should go into 4.15 except two straight forward patches still on
    hold:

    - the retpoline add on of LFENCE which waits for ACKs

    - the RSB fill after context switch

    Both should be ready to go early next week and with that we'll have
    covered the major holes of spectre_v2 and go back to normality"

    * 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (28 commits)
    x86,perf: Disable intel_bts when PTI
    security/Kconfig: Correct the Documentation reference for PTI
    x86/pti: Fix !PCID and sanitize defines
    selftests/x86: Add test_vsyscall
    x86/retpoline: Fill return stack buffer on vmexit
    x86/retpoline/irq32: Convert assembler indirect jumps
    x86/retpoline/checksum32: Convert assembler indirect jumps
    x86/retpoline/xen: Convert Xen hypercall indirect jumps
    x86/retpoline/hyperv: Convert assembler indirect jumps
    x86/retpoline/ftrace: Convert ftrace assembler indirect jumps
    x86/retpoline/entry: Convert entry assembler indirect jumps
    x86/retpoline/crypto: Convert crypto assembler indirect jumps
    x86/spectre: Add boot time option to select Spectre v2 mitigation
    x86/retpoline: Add initial retpoline support
    objtool: Allow alternatives to be ignored
    objtool: Detect jumps to retpoline thunks
    x86/pti: Make unpoison of pgd for trusted boot work for real
    x86/alternatives: Fix optimize_nops() checking
    sysfs/cpu: Fix typos in vulnerability documentation
    x86/cpu/AMD: Use LFENCE_RDTSC in preference to MFENCE_RDTSC
    ...

    Linus Torvalds
     

14 Jan, 2018

2 commits

  • Merge misc fixlets from Andrew Morton:
    "4 fixes"

    * emailed patches from Andrew Morton :
    tools/objtool/Makefile: don't assume sync-check.sh is executable
    kdump: write correct address of mem_section into vmcoreinfo
    kmemleak: allow to coexist with fault injection
    MAINTAINERS, nilfs2: change project home URLs

    Linus Torvalds
     
  • Depending on configuration mem_section can now be an array or a pointer
    to an array allocated dynamically. In most cases, we can continue to
    refer to it as 'mem_section' regardless of what it is.

    But there's one exception: '&mem_section' means "address of the array"
    if mem_section is an array, but if mem_section is a pointer, it would
    mean "address of the pointer".

    We've stepped onto this in kdump code. VMCOREINFO_SYMBOL(mem_section)
    writes down address of pointer into vmcoreinfo, not array as we wanted.

    Let's introduce VMCOREINFO_SYMBOL_ARRAY() that would handle the
    situation correctly for both cases.

    Link: http://lkml.kernel.org/r/20180112162532.35896-1-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Fixes: 83e3c48729d9 ("mm/sparsemem: Allocate mem_section at runtime for CONFIG_SPARSEMEM_EXTREME=y")
    Acked-by: Baoquan He
    Acked-by: Dave Young
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: Greg Kroah-Hartman
    Cc: Dave Young
    Cc: Baoquan He
    Cc: Vivek Goyal
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

13 Jan, 2018

1 commit


12 Jan, 2018

2 commits

  • mlx5_get_vector_affinity used to call pci_irq_get_affinity and after
    reverting the patch that sets the device affinity via PCI_IRQ_AFFINITY
    API, calling pci_irq_get_affinity becomes useless and it breaks RDMA
    mlx5 users. To fix this, this patch provides an alternative way to
    retrieve IRQ vector affinity using legacy IRQ API, following
    smp_affinity read procfs implementation.

    Fixes: 231243c82793 ("Revert mlx5: move affinity hints assignments to generic code")
    Fixes: a435393acafb ("mlx5: move affinity hints assignments to generic code")
    Cc: Sagi Grimberg
    Signed-off-by: Saeed Mahameed

    Saeed Mahameed
     
  • There are systems platform information management interfaces (such as
    HOST2BMC) for which we cannot disable local loopback multicast traffic.

    Separate disable_local_lb_mc and disable_local_lb_uc capability bits so
    driver will not disable multicast loopback traffic if not supported.
    (It is expected that Firmware will not set disable_local_lb_mc if
    HOST2BMC is running for example.)

    Function mlx5_nic_vport_update_local_lb will do best effort to
    disable/enable UC/MC loopback traffic and return success only in case it
    succeeded to changed all allowed by Firmware.

    Adapt mlx5_ib and mlx5e to support the new cap bits.

    Fixes: 2c43c5a036be ("net/mlx5e: Enable local loopback in loopback selftest")
    Fixes: c85023e153e3 ("IB/mlx5: Add raw ethernet local loopback support")
    Fixes: bded747bb432 ("net/mlx5: Add raw ethernet local loopback firmware command")
    Signed-off-by: Eran Ben Elisha
    Cc: kernel-team@fb.com
    Signed-off-by: Saeed Mahameed

    Eran Ben Elisha
     

11 Jan, 2018

1 commit

  • Daniel Borkmann says:

    ====================
    pull-request: bpf 2018-01-09

    The following pull-request contains BPF updates for your *net* tree.

    The main changes are:

    1) Prevent out-of-bounds speculation in BPF maps by masking the
    index after bounds checks in order to fix spectre v1, and
    add an option BPF_JIT_ALWAYS_ON into Kconfig that allows for
    removing the BPF interpreter from the kernel in favor of
    JIT-only mode to make spectre v2 harder, from Alexei.

    2) Remove false sharing of map refcount with max_entries which
    was used in spectre v1, from Daniel.

    3) Add a missing NULL psock check in sockmap in order to fix
    a race, from John.

    4) Fix test_align BPF selftest case since a recent change in
    verifier rejects the bit-wise arithmetic on pointers
    earlier but test_align update was missing, from Alexei.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

10 Jan, 2018

1 commit

  • In addition to commit b2157399cc98 ("bpf: prevent out-of-bounds
    speculation") also change the layout of struct bpf_map such that
    false sharing of fast-path members like max_entries is avoided
    when the maps reference counter is altered. Therefore enforce
    them to be placed into separate cachelines.

    pahole dump after change:

    struct bpf_map {
    const struct bpf_map_ops * ops; /* 0 8 */
    struct bpf_map * inner_map_meta; /* 8 8 */
    void * security; /* 16 8 */
    enum bpf_map_type map_type; /* 24 4 */
    u32 key_size; /* 28 4 */
    u32 value_size; /* 32 4 */
    u32 max_entries; /* 36 4 */
    u32 map_flags; /* 40 4 */
    u32 pages; /* 44 4 */
    u32 id; /* 48 4 */
    int numa_node; /* 52 4 */
    bool unpriv_array; /* 56 1 */

    /* XXX 7 bytes hole, try to pack */

    /* --- cacheline 1 boundary (64 bytes) --- */
    struct user_struct * user; /* 64 8 */
    atomic_t refcnt; /* 72 4 */
    atomic_t usercnt; /* 76 4 */
    struct work_struct work; /* 80 32 */
    char name[16]; /* 112 16 */
    /* --- cacheline 2 boundary (128 bytes) --- */

    /* size: 128, cachelines: 2, members: 17 */
    /* sum members: 121, holes: 1, sum holes: 7 */
    };

    Now all entries in the first cacheline are read only throughout
    the life time of the map, set up once during map creation. Overall
    struct size and number of cachelines doesn't change from the
    reordering. struct bpf_map is usually first member and embedded
    in map structs in specific map implementations, so also avoid those
    members to sit at the end where it could potentially share the
    cacheline with first map values e.g. in the array since remote
    CPUs could trigger map updates just as well for those (easily
    dirtying members like max_entries intentionally as well) while
    having subsequent values in cache.

    Quoting from Google's Project Zero blog [1]:

    Additionally, at least on the Intel machine on which this was
    tested, bouncing modified cache lines between cores is slow,
    apparently because the MESI protocol is used for cache coherence
    [8]. Changing the reference counter of an eBPF array on one
    physical CPU core causes the cache line containing the reference
    counter to be bounced over to that CPU core, making reads of the
    reference counter on all other CPU cores slow until the changed
    reference counter has been written back to memory. Because the
    length and the reference counter of an eBPF array are stored in
    the same cache line, this also means that changing the reference
    counter on one physical CPU core causes reads of the eBPF array's
    length to be slow on other physical CPU cores (intentional false
    sharing).

    While this doesn't 'control' the out-of-bounds speculation through
    masking the index as in commit b2157399cc98, triggering a manipulation
    of the map's reference counter is really trivial, so lets not allow
    to easily affect max_entries from it.

    Splitting to separate cachelines also generally makes sense from
    a performance perspective anyway in that fast-path won't have a
    cache miss if the map gets pinned, reused in other progs, etc out
    of control path, thus also avoids unintentional false sharing.

    [1] https://googleprojectzero.blogspot.ch/2018/01/reading-privileged-memory-with-side.html

    Signed-off-by: Daniel Borkmann
    Signed-off-by: Alexei Starovoitov

    Daniel Borkmann
     

09 Jan, 2018

4 commits

  • Pull networking fixes from David Miller:

    1) Frag and UDP handling fixes in i40e driver, from Amritha Nambiar and
    Alexander Duyck.

    2) Undo unintentional UAPI change in netfilter conntrack, from Florian
    Westphal.

    3) Revert a change to how error codes are returned from
    dev_get_valid_name(), it broke some apps.

    4) Cannot cache routes for ipv6 tunnels in the tunnel is ipv4/ipv6
    dual-stack. From Eli Cooper.

    5) Fix missed PMTU updates in geneve, from Xin Long.

    6) Cure double free in macvlan, from Gao Feng.

    7) Fix heap out-of-bounds write in rds_message_alloc_sgs(), from
    Mohamed Ghannam.

    8) FEC bug fixes from FUgang Duan (mis-accounting of dev_id, missed
    deferral of probe when the regulator is not ready yet).

    9) Missing DMA mapping error checks in 3c59x, from Neil Horman.

    10) Turn off Broadcom tags for some b53 switches, from Florian Fainelli.

    11) Fix OOPS when get_target_net() is passed an SKB whose NETLINK_CB()
    isn't initialized. From Andrei Vagin.

    12) Fix crashes in fib6_add(), from Wei Wang.

    13) PMTU bug fixes in SCTP from Marcelo Ricardo Leitner.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (56 commits)
    sh_eth: fix TXALCR1 offsets
    mdio-sun4i: Fix a memory leak
    phylink: mark expected switch fall-throughs in phylink_mii_ioctl
    sctp: fix the handling of ICMP Frag Needed for too small MTUs
    sctp: do not retransmit upon FragNeeded if PMTU discovery is disabled
    xen-netfront: enable device after manual module load
    bnxt_en: Fix the 'Invalid VF' id check in bnxt_vf_ndo_prep routine.
    bnxt_en: Fix population of flow_type in bnxt_hwrm_cfa_flow_alloc()
    sh_eth: fix SH7757 GEther initialization
    net: fec: free/restore resource in related probe error pathes
    uapi/if_ether.h: prevent redefinition of struct ethhdr
    ipv6: fix general protection fault in fib6_add()
    RDS: null pointer dereference in rds_atomic_free_op
    sh_eth: fix TSU resource handling
    net: stmmac: enable EEE in MII, GMII or RGMII only
    rtnetlink: give a user socket to get_target_net()
    MAINTAINERS: Update my email address.
    can: ems_usb: improve error reporting for error warning and error passive
    can: flex_can: Correct the checking for frame length in flexcan_start_xmit()
    can: gs_usb: fix return value of the "set_bittiming" callback
    ...

    Linus Torvalds
     
  • Under speculation, CPUs may mis-predict branches in bounds checks. Thus,
    memory accesses under a bounds check may be speculated even if the
    bounds check fails, providing a primitive for building a side channel.

    To avoid leaking kernel data round up array-based maps and mask the index
    after bounds check, so speculated load with out of bounds index will load
    either valid value from the array or zero from the padded area.

    Unconditionally mask index for all array types even when max_entries
    are not rounded to power of 2 for root user.
    When map is created by unpriv user generate a sequence of bpf insns
    that includes AND operation to make sure that JITed code includes
    the same 'index & index_mask' operation.

    If prog_array map is created by unpriv user replace
    bpf_tail_call(ctx, map, index);
    with
    if (index >= max_entries) {
    index &= map->index_mask;
    bpf_tail_call(ctx, map, index);
    }
    (along with roundup to power 2) to prevent out-of-bounds speculation.
    There is secondary redundant 'if (index >= max_entries)' in the interpreter
    and in all JITs, but they can be optimized later if necessary.

    Other array-like maps (cpumap, devmap, sockmap, perf_event_array, cgroup_array)
    cannot be used by unpriv, so no changes there.

    That fixes bpf side of "Variant 1: bounds check bypass (CVE-2017-5753)" on
    all architectures with and without JIT.

    v2->v3:
    Daniel noticed that attack potentially can be crafted via syscall commands
    without loading the program, so add masking to those paths as well.

    Signed-off-by: Alexei Starovoitov
    Acked-by: John Fastabend
    Signed-off-by: Daniel Borkmann

    Alexei Starovoitov
     
  • syzbot reported a hang involving SCTP, on which it kept flooding dmesg
    with the message:
    [ 246.742374] sctp: sctp_transport_update_pmtu: Reported pmtu 508 too
    low, using default minimum of 512

    That happened because whenever SCTP hits an ICMP Frag Needed, it tries
    to adjust to the new MTU and triggers an immediate retransmission. But
    it didn't consider the fact that MTUs smaller than the SCTP minimum MTU
    allowed (512) would not cause the PMTU to change, and issued the
    retransmission anyway (thus leading to another ICMP Frag Needed, and so
    on).

    As IPv4 (ip_rt_min_pmtu=556) and IPv6 (IPV6_MIN_MTU=1280) minimum MTU
    are higher than that, sctp_transport_update_pmtu() is changed to
    re-fetch the PMTU that got set after our request, and with that, detect
    if there was an actual change or not.

    The fix, thus, skips the immediate retransmission if the received ICMP
    resulted in no change, in the hope that SCTP will select another path.

    Note: The value being used for the minimum MTU (512,
    SCTP_DEFAULT_MINSEGMENT) is not right and instead it should be (576,
    SCTP_MIN_PMTU), but such change belongs to another patch.

    Changes from v1:
    - do not disable PMTU discovery, in the light of commit
    06ad391919b2 ("[SCTP] Don't disable PMTU discovery when mtu is small")
    and as suggested by Xin Long.
    - changed the way to break the rtx loop by detecting if the icmp
    resulted in a change or not
    Changes from v2:
    none

    See-also: https://lkml.org/lkml/2017/12/22/811
    Reported-by: syzbot
    Signed-off-by: Marcelo Ricardo Leitner
    Signed-off-by: David S. Miller

    Marcelo Ricardo Leitner
     
  • There's two cross-release leftover facilities:

    - the crossrelease_hist_*() irq-tracing callbacks (NOPs currently)
    - the complete_release_commit() callback (NOP as well)

    Remove them.

    Cc: David Sterba
    Cc: Byungchul Park
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar