14 Dec, 2015

2 commits

  • Jan Stancek reported that I wrecked things for him by fixing things for
    Vladimir :/

    His report was due to an UNINTERRUPTIBLE wait getting -EINTR, which
    should not be possible, however my previous patch made this possible by
    unconditionally checking signal_pending().

    We cannot use current->state as was done previously, because the
    instruction after the store to that variable it can be changed. We must
    instead pass the initial state along and use that.

    Fixes: 68985633bccb ("sched/wait: Fix signal handling in bit wait helpers")
    Reported-by: Jan Stancek
    Reported-by: Chris Mason
    Tested-by: Jan Stancek
    Tested-by: Vladimir Murzin
    Tested-by: Chris Mason
    Reviewed-by: Paul Turner
    Cc: Ingo Molnar
    Cc: tglx@linutronix.de
    Cc: Oleg Nesterov
    Cc: hpa@zytor.com
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Pull NFS client bugfix from Trond Myklebust:
    "SUNRPC: Fix a NFSv4.1 callback channel regression"

    * tag 'nfs-for-4.4-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
    SUNRPC: Fix callback channel

    Linus Torvalds
     

08 Dec, 2015

1 commit

  • The NFSv4.1 callback channel is currently broken because the receive
    message will keep shrinking because the backchannel receive buffer size
    never gets reset.
    The easiest solution to this problem is instead of changing the receive
    buffer, to rather adjust the copied request.

    Fixes: 38b7631fbe42 ("nfs4: limit callback decoding to received bytes")
    Cc: Benjamin Coddington
    Cc: stable@vger.kernel.org
    Signed-off-by: Trond Myklebust

    Trond Myklebust
     

07 Dec, 2015

1 commit

  • The following commit which went into mainline through networking tree

    3b13758f51de ("cgroups: Allow dynamically changing net_classid")

    conflicts in net/core/netclassid_cgroup.c with the following pending
    fix in cgroup/for-4.4-fixes.

    1f7dd3e5a6e4 ("cgroup: fix handling of multi-destination migration from subtree_control enabling")

    The former separates out update_classid() from cgrp_attach() and
    updates it to walk all fds of all tasks in the target css so that it
    can be used from both migration and config change paths. The latter
    drops @css from cgrp_attach().

    Resolve the conflict by making cgrp_attach() call update_classid()
    with the css from the first task. We can revive @tset walking in
    cgrp_attach() but given that net_cls is v1 only where there always is
    only one target css during migration, this is fine.

    Signed-off-by: Tejun Heo
    Reported-by: Stephen Rothwell
    Cc: Nina Schiff

    Tejun Heo
     

04 Dec, 2015

9 commits

  • Pull networking fixes from David Miller:
    "A lot of Thanksgiving turkey leftovers accumulated, here goes:

    1) Fix bluetooth l2cap_chan object leak, from Johan Hedberg.

    2) IDs for some new iwlwifi chips, from Oren Givon.

    3) Fix rtlwifi lockups on boot, from Larry Finger.

    4) Fix memory leak in fm10k, from Stephen Hemminger.

    5) We have a route leak in the ipv6 tunnel infrastructure, fix from
    Paolo Abeni.

    6) Fix buffer pointer handling in arm64 bpf JIT,f rom Zi Shen Lim.

    7) Wrong lockdep annotations in tcp md5 support, fix from Eric
    Dumazet.

    8) Work around some middle boxes which prevent proper handling of TCP
    Fast Open, from Yuchung Cheng.

    9) TCP repair can do huge kmalloc() requests, build paged SKBs
    instead. From Eric Dumazet.

    10) Fix msg_controllen overflow in scm_detach_fds, from Daniel
    Borkmann.

    11) Fix device leaks on ipmr table destruction in ipv4 and ipv6, from
    Nikolay Aleksandrov.

    12) Fix use after free in epoll with AF_UNIX sockets, from Rainer
    Weikusat.

    13) Fix double free in VRF code, from Nikolay Aleksandrov.

    14) Fix skb leaks on socket receive queue in tipc, from Ying Xue.

    15) Fix ifup/ifdown crach in xgene driver, from Iyappan Subramanian.

    16) Fix clearing of persistent array maps in bpf, from Daniel
    Borkmann.

    17) In TCP, for the cross-SYN case, we don't initialize tp->copied_seq
    early enough. From Eric Dumazet.

    18) Fix out of bounds accesses in bpf array implementation when
    updating elements, from Daniel Borkmann.

    19) Fill gaps in RCU protection of np->opt in ipv6 stack, from Eric
    Dumazet.

    20) When dumping proxy neigh entries, we have to accomodate NULL
    device pointers properly, from Konstantin Khlebnikov.

    21) SCTP doesn't release all ipv6 socket resources properly, fix from
    Eric Dumazet.

    22) Prevent underflows of sch->q.qlen for multiqueue packet
    schedulers, also from Eric Dumazet.

    23) Fix MAC and unicast list handling in bnxt_en driver, from Jeffrey
    Huang and Michael Chan.

    24) Don't actively scan radar channels, from Antonio Quartulli"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (110 commits)
    net: phy: reset only targeted phy
    bnxt_en: Setup uc_list mac filters after resetting the chip.
    bnxt_en: enforce proper storing of MAC address
    bnxt_en: Fixed incorrect implementation of ndo_set_mac_address
    net: lpc_eth: remove irq > NR_IRQS check from probe()
    net_sched: fix qdisc_tree_decrease_qlen() races
    openvswitch: fix hangup on vxlan/gre/geneve device deletion
    ipv4: igmp: Allow removing groups from a removed interface
    ipv6: sctp: implement sctp_v6_destroy_sock()
    arm64: bpf: add 'store immediate' instruction
    ipv6: kill sk_dst_lock
    ipv6: sctp: add rcu protection around np->opt
    net/neighbour: fix crash at dumping device-agnostic proxy entries
    sctp: use GFP_USER for user-controlled kmalloc
    sctp: convert sack_needed and sack_generation to bits
    ipv6: add complete rcu protection around np->opt
    bpf: fix allocation warnings in bpf maps and integer overflow
    mvebu: dts: enable IP checksum with jumbo frames for Armada 38x on Port0
    net: mvneta: enable setting custom TX IP checksum limit
    net: mvneta: fix error path for building skb
    ...

    Linus Torvalds
     
  • …kernel/git/jberg/mac80211

    Johannes Berg says:

    ====================
    A small set of fixes for 4.4:
    * fix scanning in mac80211 to not actively scan radar
    channels (from Antonio)
    * fix uninitialized variable in remain-on-channel that
    could lead to treating frame TX as remain-on-channel
    and not sending the frame at all
    * remove NL80211_FEATURE_FULL_AP_CLIENT_STATE again, it
    was broken and needs more work, we'll enable it later
    * fix call_rcu() induced use-after-reset/free in mesh
    (that was suddenly causing issues in certain tests)
    * always request block-ack window size 64 as we found
    some APs will otherwise crash (really ...)
    * fix P2P-Device teardown sequence to avoid restarting
    with uninitialized data
    ====================

    Signed-off-by: David S. Miller <davem@davemloft.net>

    David S. Miller
     
  • qdisc_tree_decrease_qlen() suffers from two problems on multiqueue
    devices.

    One problem is that it updates sch->q.qlen and sch->qstats.drops
    on the mq/mqprio root qdisc, while it should not : Daniele
    reported underflows errors :
    [ 681.774821] PAX: sch->q.qlen: 0 n: 1
    [ 681.774825] PAX: size overflow detected in function qdisc_tree_decrease_qlen net/sched/sch_api.c:769 cicus.693_49 min, count: 72, decl: qlen; num: 0; context: sk_buff_head;
    [ 681.774954] CPU: 2 PID: 19 Comm: ksoftirqd/2 Tainted: G O 4.2.6.201511282239-1-grsec #1
    [ 681.774955] Hardware name: ASUSTeK COMPUTER INC. X302LJ/X302LJ, BIOS X302LJ.202 03/05/2015
    [ 681.774956] ffffffffa9a04863 0000000000000000 0000000000000000 ffffffffa990ff7c
    [ 681.774959] ffffc90000d3bc38 ffffffffa95d2810 0000000000000007 ffffffffa991002b
    [ 681.774960] ffffc90000d3bc68 ffffffffa91a44f4 0000000000000001 0000000000000001
    [ 681.774962] Call Trace:
    [ 681.774967] [] dump_stack+0x4c/0x7f
    [ 681.774970] [] report_size_overflow+0x34/0x50
    [ 681.774972] [] qdisc_tree_decrease_qlen+0x152/0x160
    [ 681.774976] [] fq_codel_dequeue+0x7b1/0x820 [sch_fq_codel]
    [ 681.774978] [] ? qdisc_peek_dequeued+0xa0/0xa0 [sch_fq_codel]
    [ 681.774980] [] __qdisc_run+0x4d/0x1d0
    [ 681.774983] [] net_tx_action+0xc2/0x160
    [ 681.774985] [] __do_softirq+0xf1/0x200
    [ 681.774987] [] run_ksoftirqd+0x1e/0x30
    [ 681.774989] [] smpboot_thread_fn+0x150/0x260
    [ 681.774991] [] ? sort_range+0x40/0x40
    [ 681.774992] [] kthread+0xe4/0x100
    [ 681.774994] [] ? kthread_worker_fn+0x170/0x170
    [ 681.774995] [] ret_from_fork+0x3e/0x70

    mq/mqprio have their own ways to report qlen/drops by folding stats on
    all their queues, with appropriate locking.

    A second problem is that qdisc_tree_decrease_qlen() calls qdisc_lookup()
    without proper locking : concurrent qdisc updates could corrupt the list
    that qdisc_match_from_root() parses to find a qdisc given its handle.

    Fix first problem adding a TCQ_F_NOPARENT qdisc flag that
    qdisc_tree_decrease_qlen() can use to abort its tree traversal,
    as soon as it meets a mq/mqprio qdisc children.

    Second problem can be fixed by RCU protection.
    Qdisc are already freed after RCU grace period, so qdisc_list_add() and
    qdisc_list_del() simply have to use appropriate rcu list variants.

    A future patch will add a per struct netdev_queue list anchor, so that
    qdisc_tree_decrease_qlen() can have more efficient lookups.

    Reported-by: Daniele Fucini
    Signed-off-by: Eric Dumazet
    Cc: Cong Wang
    Cc: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Each openvswitch tunnel vport (vxlan,gre,geneve) holds a reference
    to the underlying tunnel device, but never released it when such
    device is deleted.
    Deleting the underlying device via the ip tool cause the kernel to
    hangup in the netdev_wait_allrefs() loop.
    This commit ensure that on device unregistration dp_detach_port_notify()
    is called for all vports that hold the device reference, properly
    releasing it.

    Fixes: 614732eaa12d ("openvswitch: Use regular VXLAN net_device device")
    Fixes: b2acd1dc3949 ("openvswitch: Use regular GRE net_device instead of vport")
    Fixes: 6b001e682e90 ("openvswitch: Use Geneve device.")
    Signed-off-by: Paolo Abeni
    Acked-by: Flavio Leitner
    Acked-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    Paolo Abeni
     
  • When a multicast group is joined on a socket, a struct ip_mc_socklist
    is appended to the sockets mc_list containing information about the
    joined group.

    If the interface is hot unplugged, this entry becomes stale. Prior to
    commit 52ad353a5344f ("igmp: fix the problem when mc leave group") it
    was possible to remove the stale entry by performing a
    IP_DROP_MEMBERSHIP, passing either the old ifindex or ip address on
    the interface. However, this fix enforces that the interface must
    still exist. Thus with time, the number of stale entries grows, until
    sysctl_igmp_max_memberships is reached and then it is not possible to
    join and more groups.

    The previous patch fixes an issue where a IP_DROP_MEMBERSHIP is
    performed without specifying the interface, either by ifindex or ip
    address. However here we do supply one of these. So loosen the
    restriction on device existence to only apply when the interface has
    not been specified. This then restores the ability to clean up the
    stale entries.

    Signed-off-by: Andrew Lunn
    Fixes: 52ad353a5344f "(igmp: fix the problem when mc leave group")
    Signed-off-by: David S. Miller

    Andrew Lunn
     
  • Dmitry Vyukov reported a memory leak using IPV6 SCTP sockets.

    We need to call inet6_destroy_sock() to properly release
    inet6 specific fields.

    Reported-by: Dmitry Vyukov
    Signed-off-by: Eric Dumazet
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Johan Hedberg says:

    ====================
    pull request: bluetooth 2015-12-01

    Here's a Bluetooth fix for the 4.4-rc series that fixes a memory leak of
    the Security Manager L2CAP channel that'll happen for every LE
    connection.

    Please let me know if there are any issues pulling. Thanks.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • While testing the np->opt RCU conversion, I found that UDP/IPv6 was
    using a mixture of xchg() and sk_dst_lock to protect concurrent changes
    to sk->sk_dst_cache, leading to possible corruptions and crashes.

    ip6_sk_dst_lookup_flow() uses sk_dst_check() anyway, so the simplest
    way to fix the mess is to remove sk_dst_lock completely, as we did for
    IPv4.

    __ip6_dst_store() and ip6_dst_store() share same implementation.

    sk_setup_caps() being called with socket lock being held or not,
    we have to use sk_dst_set() instead of __sk_dst_set()

    Note that I had to move the "np->dst_cookie = rt6_get_cookie(rt);"
    in ip6_dst_store() before the sk_setup_caps(sk, dst) call.

    This is because ip6_dst_store() can be called from process context,
    without any lock held.

    As soon as the dst is installed in sk->sk_dst_cache, dst can be freed
    from another cpu doing a concurrent ip6_dst_store()

    Doing the dst dereference before doing the install is needed to make
    sure no use after free would trigger.

    Signed-off-by: Eric Dumazet
    Reported-by: Dmitry Vyukov
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • This patch completes the work I did in commit 45f6fad84cc3
    ("ipv6: add complete rcu protection around np->opt"), as I missed
    sctp part.

    This simply makes sure np->opt is used with proper RCU locking
    and accessors.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

03 Dec, 2015

8 commits

  • Consider the following v2 hierarchy.

    P0 (+memory) --- P1 (-memory) --- A
    \- B

    P0 has memory enabled in its subtree_control while P1 doesn't. If
    both A and B contain processes, they would belong to the memory css of
    P1. Now if memory is enabled on P1's subtree_control, memory csses
    should be created on both A and B and A's processes should be moved to
    the former and B's processes the latter. IOW, enabling controllers
    can cause atomic migrations into different csses.

    The core cgroup migration logic has been updated accordingly but the
    controller migration methods haven't and still assume that all tasks
    migrate to a single target css; furthermore, the methods were fed the
    css in which subtree_control was updated which is the parent of the
    target csses. pids controller depends on the migration methods to
    move charges and this made the controller attribute charges to the
    wrong csses often triggering the following warning by driving a
    counter negative.

    WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40()
    Modules linked in:
    CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #29
    ...
    ffffffff81f65382 ffff88007c043b90 ffffffff81551ffc 0000000000000000
    ffff88007c043bc8 ffffffff810de202 ffff88007a752000 ffff88007a29ab00
    ffff88007c043c80 ffff88007a1d8400 0000000000000001 ffff88007c043bd8
    Call Trace:
    [] dump_stack+0x4e/0x82
    [] warn_slowpath_common+0x82/0xc0
    [] warn_slowpath_null+0x1a/0x20
    [] pids_cancel.constprop.6+0x31/0x40
    [] pids_can_attach+0x6d/0xf0
    [] cgroup_taskset_migrate+0x6c/0x330
    [] cgroup_migrate+0xf5/0x190
    [] cgroup_attach_task+0x176/0x200
    [] __cgroup_procs_write+0x2ad/0x460
    [] cgroup_procs_write+0x14/0x20
    [] cgroup_file_write+0x35/0x1c0
    [] kernfs_fop_write+0x141/0x190
    [] __vfs_write+0x28/0xe0
    [] vfs_write+0xac/0x1a0
    [] SyS_write+0x49/0xb0
    [] entry_SYSCALL_64_fastpath+0x12/0x76

    This patch fixes the bug by removing @css parameter from the three
    migration methods, ->can_attach, ->cancel_attach() and ->attach() and
    updating cgroup_taskset iteration helpers also return the destination
    css in addition to the task being migrated. All controllers are
    updated accordingly.

    * Controllers which don't care whether there are one or multiple
    target csses can be converted trivially. cpu, io, freezer, perf,
    netclassid and netprio fall in this category.

    * cpuset's current implementation assumes that there's single source
    and destination and thus doesn't support v2 hierarchy already. The
    only change made by this patchset is how that single destination css
    is obtained.

    * memory migration path already doesn't do anything on v2. How the
    single destination css is obtained is updated and the prep stage of
    mem_cgroup_can_attach() is reordered to accomodate the change.

    * pids is the only controller which was affected by this bug. It now
    correctly handles multi-destination migrations and no longer causes
    counter underflow from incorrect accounting.

    Signed-off-by: Tejun Heo
    Reported-and-tested-by: Daniel Wagner
    Cc: Aleksa Sarai

    Tejun Heo
     
  • Proxy entries could have null pointer to net-device.

    Signed-off-by: Konstantin Khlebnikov
    Fixes: 84920c1420e2 ("net: Allow ipv6 proxies and arp proxies be shown with iproute2")
    Signed-off-by: David S. Miller

    Konstantin Khlebnikov
     
  • Dmitry Vyukov reported that the user could trigger a kernel warning by
    using a large len value for getsockopt SCTP_GET_LOCAL_ADDRS, as that
    value directly affects the value used as a kmalloc() parameter.

    This patch thus switches the allocation flags from all user-controllable
    kmalloc size to GFP_USER to put some more restrictions on it and also
    disables the warn, as they are not necessary.

    Signed-off-by: Marcelo Ricardo Leitner
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Marcelo Ricardo Leitner
     
  • This patch addresses multiple problems :

    UDP/RAW sendmsg() need to get a stable struct ipv6_txoptions
    while socket is not locked : Other threads can change np->opt
    concurrently. Dmitry posted a syzkaller
    (http://github.com/google/syzkaller) program desmonstrating
    use-after-free.

    Starting with TCP/DCCP lockless listeners, tcp_v6_syn_recv_sock()
    and dccp_v6_request_recv_sock() also need to use RCU protection
    to dereference np->opt once (before calling ipv6_dup_options())

    This patch adds full RCU protection to np->opt

    Reported-by: Dmitry Vyukov
    Signed-off-by: Eric Dumazet
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • In the last change here, I neglected to update the cookie in one code
    path: when a mgmt-tx has no real cookie sent to userspace as it doesn't
    wait for a response, but is off-channel. The original code used the SKB
    pointer as the cookie and always assigned the cookie to the TX SKB in
    ieee80211_start_roc_work(), but my change turned this around and made
    the code rely on a valid cookie being passed in.

    Unfortunately, the off-channel no-wait TX path wasn't assigning one at
    all, resulting in an uninitialized stack value being used. This wasn't
    handed back to userspace as a cookie (since in the no-wait case there
    isn't a cookie), but it was tested for non-zero to distinguish between
    mgmt-tx and off-channel.

    Fix this by assigning a dummy non-zero cookie unconditionally, and get
    rid of a misleading comment and some dead code while at it. I'll clean
    up the ACK SKB handling separately later.

    Fixes: 3b79af973cf4 ("mac80211: stop using pointers as userspace cookies")
    Signed-off-by: Johannes Berg

    Johannes Berg
     
  • DFS channels should not be actively scanned as we can't be sure
    if we are allowed or not.

    If the current channel is in the DFS band, active scan might be
    performed after CSA, but we have no guarantee about other channels,
    therefore it is safer to prevent active scanning at all.

    Cc: stable@vger.kernel.org
    Signed-off-by: Antonio Quartulli
    Signed-off-by: Johannes Berg

    Antonio Quartulli
     
  • Interfaces are being initialized (setup) on addition,
    and torn down on removal.

    However, p2p device is being torn down when stopped,
    resulting in the next p2p start operation being done
    on uninitialized interface.

    Solve it by calling ieee80211_teardown_sdata() only
    on interface removal (for the non-netdev case).

    Signed-off-by: Eliad Peller
    Signed-off-by: Emmanuel Grumbach
    [squashed in fix to call teardown after unregister]
    Signed-off-by: Johannes Berg

    Eliad Peller
     
  • After 614732eaa12d, no refcount is maintained for the vport-vxlan module.
    This allows the userspace to remove such module while vport-vxlan
    devices still exist, which leads to later oops.

    v1 -> v2:
    - move vport 'owner' initialization in ovs_vport_ops_register()
    and make such function a macro

    Fixes: 614732eaa12d ("openvswitch: Use regular VXLAN net_device device")
    Signed-off-by: Paolo Abeni
    Signed-off-by: David S. Miller

    Paolo Abeni
     

02 Dec, 2015

3 commits

  • Dmitry provided a syzkaller (http://github.com/google/syzkaller)
    triggering a fault in sock_wake_async() when async IO is requested.

    Said program stressed af_unix sockets, but the issue is generic
    and should be addressed in core networking stack.

    The problem is that by the time sock_wake_async() is called,
    we should not access the @flags field of 'struct socket',
    as the inode containing this socket might be freed without
    further notice, and without RCU grace period.

    We already maintain an RCU protected structure, "struct socket_wq"
    so moving SOCKWQ_ASYNC_NOSPACE & SOCKWQ_ASYNC_WAITDATA into it
    is the safe route.

    It also reduces number of cache lines needing dirtying, so might
    provide a performance improvement anyway.

    In followup patches, we might move remaining flags (SOCK_NOSPACE,
    SOCK_PASSCRED, SOCK_PASSSEC) to save 8 bytes and let 'struct socket'
    being mostly read and let it being shared between cpus.

    Reported-by: Dmitry Vyukov
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • This patch is a cleanup to make following patch easier to
    review.

    Goal is to move SOCK_ASYNC_NOSPACE and SOCK_ASYNC_WAITDATA
    from (struct socket)->flags to a (struct socket_wq)->flags
    to benefit from RCU protection in sock_wake_async()

    To ease backports, we rename both constants.

    Two new helpers, sk_set_bit(int nr, struct sock *sk)
    and sk_clear_bit(int net, struct sock *sk) are added so that
    following patch can change their implementation.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • This reverts commit ab450605b35caa768ca33e86db9403229bf42be4.

    In IPv6, we cannot inherit the dst of the original dst. ndisc packets
    are IPv6 packets and may take another route than the original packet.

    This patch breaks the following scenario: a packet comes from eth0 and
    is forwarded through vxlan1. The encapsulated packet triggers an NS
    which cannot be sent because of the wrong route.

    CC: Jiri Benc
    CC: Thomas Graf
    Signed-off-by: Nicolas Dichtel
    Signed-off-by: David S. Miller

    Nicolas Dichtel
     

01 Dec, 2015

2 commits

  • Dmitry provided a syzkaller (http://github.com/google/syzkaller)
    generated program that triggers the WARNING at
    net/ipv4/tcp.c:1729 in tcp_recvmsg() :

    WARN_ON(tp->copied_seq != tp->rcv_nxt &&
    !(flags & (MSG_PEEK | MSG_TRUNC)));

    His program is specifically attempting a Cross SYN TCP exchange,
    that we support (for the pleasure of hackers ?), but it looks we
    lack proper tcp->copied_seq initialization.

    Thanks again Dmitry for your report and testings.

    Signed-off-by: Eric Dumazet
    Reported-by: Dmitry Vyukov
    Tested-by: Dmitry Vyukov
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • sendpage did not care about credentials at all. This could lead to
    situations in which because of fd passing between processes we could
    append data to skbs with different scm data. It is illegal to splice those
    skbs together. Instead we have to allocate a new skb and if requested
    fill out the scm details.

    Fixes: 869e7c62486ec ("net: af_unix: implement stream sendpage support")
    Reported-by: Al Viro
    Cc: Al Viro
    Cc: Eric Dumazet
    Signed-off-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Hannes Frederic Sowa
     

30 Nov, 2015

1 commit

  • Commit 9c7077622dd91 ("packet: make packet_snd fail on len smaller
    than l2 header") added validation for the packet size in packet_snd.
    This change enforces that every packet needs a header (with at least
    hard_header_len bytes) plus a payload with at least one byte. Before
    this change the payload was optional.

    This fixes PPPoE connections which do not have a "Service" or
    "Host-Uniq" configured (which is violating the spec, but is still
    widely used in real-world setups). Those are currently failing with the
    following message: "pppd: packet size is too short (24
    Signed-off-by: David S. Miller

    Martin Blumenstingl
     

25 Nov, 2015

6 commits

  • Sasha's found a NULL pointer dereference in the RDS connection code when
    sending a message to an apparently unbound socket. The problem is caused
    by the code checking if the socket is bound in rds_sendmsg(), which checks
    the rs_bound_addr field without taking a lock on the socket. This opens a
    race where rs_bound_addr is temporarily set but where the transport is not
    in rds_bind(), leading to a NULL pointer dereference when trying to
    dereference 'trans' in __rds_conn_create().

    Vegard wrote a reproducer for this issue, so kindly ask him to share if
    you're interested.

    I cannot reproduce the NULL pointer dereference using Vegard's reproducer
    with this patch, whereas I could without.

    Complete earlier incomplete fix to CVE-2015-6937:

    74e98eb08588 ("RDS: verify the underlying transport exists before creating a connection")

    Cc: David S. Miller
    Cc: stable@vger.kernel.org

    Reviewed-by: Vegard Nossum
    Reviewed-by: Sasha Levin
    Acked-by: Santosh Shilimkar
    Signed-off-by: Quentin Casasnovas
    Signed-off-by: David S. Miller

    Quentin Casasnovas
     
  • During pre-upstream development, the openvswitch datapath used a custom
    hashtable to store vports that could fail on delete due to lack of
    memory. However, prior to upstream submission, this code was reworked to
    use an hlist based hastable with flexible-array based buckets. As such
    the failure condition was eliminated from the vport_del path, rendering
    this comment invalid.

    Signed-off-by: Aaron Conole
    Signed-off-by: David S. Miller

    Aaron Conole
     
  • Since (at least) commit b17a7c179dd3 ("[NET]: Do sysfs registration as
    part of register_netdevice."), netdev_run_todo() deals only with
    unregistration, so we don't need to do the rtnl_unlock/lock cycle to
    finish registration when failing pimreg or dvmrp device creation. In
    fact that opens a race condition where someone can delete the device
    while rtnl is unlocked because it's fully registered. The problem gets
    worse when netlink support is introduced as there are more points of entry
    that can cause it and it also makes reusing that code correctly impossible.

    Signed-off-by: Nikolay Aleksandrov
    Reviewed-by: Cong Wang
    Signed-off-by: David S. Miller

    Nikolay Aleksandrov
     
  • Normally, the transmit phase of a client call is implicitly ack'd by the
    reception of the first data packet of the response being received.
    However, if a security negotiation happens, the transmit phase, if it is
    entirely contained in a single packet, may get an ack packet in response
    and then may get aborted due to security negotiation failure.

    Because the client has shifted state to RXRPC_CALL_CLIENT_AWAIT_REPLY due
    to having transmitted all the data, the code that handles processing of the
    received ack packet doesn't note the hard ack the data packet.

    The following abort packet in the case of security negotiation failure then
    incurs an assertion failure when it tries to drain the Tx queue because the
    hard ack state is out of sync (hard ack means the packets have been
    processed and can be discarded by the sender; a soft ack means that the
    packets are received but could still be discarded and rerequested by the
    receiver).

    To fix this, we should record the hard ack we received for the ack packet.

    The assertion failure looks like:

    RxRPC: Assertion failed
    1 ] [] rxrpc_rotate_tx_window+0xbc/0x131 [af_rxrpc]
    ...

    Signed-off-by: David Howells
    Signed-off-by: David S. Miller

    David Howells
     
  • If a fragmented multicast packet is received on an ethernet device which
    has an active macvlan on top of it, each fragment is duplicated and
    received both on the underlying device and the macvlan. If some
    fragments for macvlan are processed before the whole packet for the
    underlying device is reassembled, the "overlapping fragments" test in
    ip6_frag_queue() discards the whole fragment queue.

    To resolve this, add device ifindex to the search key and require it to
    match reassembling multicast packets and packets to link-local
    addresses.

    Note: similar patch has been already submitted by Yoshifuji Hideaki in

    http://patchwork.ozlabs.org/patch/220979/

    but got lost and forgotten for some reason.

    Signed-off-by: Michal Kubecek
    Signed-off-by: David S. Miller

    Michal Kubeček
     
  • Coverity says:

    *** CID 1338065: Error handling issues (CHECKED_RETURN)
    /net/tipc/udp_media.c: 162 in tipc_udp_send_msg()
    156 struct udp_media_addr *dst = (struct udp_media_addr *)&dest->value;
    157 struct udp_media_addr *src = (struct udp_media_addr *)&b->addr.value;
    158 struct sk_buff *clone;
    159 struct rtable *rt;
    160
    161 if (skb_headroom(skb) < UDP_MIN_HEADROOM)
    >>> CID 1338065: Error handling issues (CHECKED_RETURN)
    >>> Calling "pskb_expand_head" without checking return value (as is done elsewhere 51 out of 56 times).
    162 pskb_expand_head(skb, UDP_MIN_HEADROOM, 0, GFP_ATOMIC);
    163
    164 clone = skb_clone(skb, GFP_ATOMIC);
    165 skb_set_inner_protocol(clone, htons(ETH_P_TIPC));
    166 ub = rcu_dereference_rtnl(b->media_ptr);
    167 if (!ub) {

    When expanding buffer headroom over udp tunnel with pskb_expand_head(),
    it's unfortunate that we don't check its return value. As a result, if
    the function returns an error code due to the lack of memory, it may
    cause unpredictable consequence as we unconditionally consider that
    it's always successful.

    Fixes: e53567948f82 ("tipc: conditionally expand buffer headroom over udp tunnel")
    Reported-by:
    Cc: Stephen Hemminger
    Signed-off-by: Ying Xue
    Signed-off-by: David S. Miller

    Ying Xue
     

24 Nov, 2015

5 commits

  • Even if we drain receive queue thoroughly in tipc_release() after tipc
    socket is removed from rhashtable, it is possible that some packets
    are in flight because some CPU runs receiver and did rhashtable lookup
    before we removed socket. They will achieve receive queue, but nobody
    delete them at all. To avoid this leak, we register a private socket
    destructor to purge receive queue, meaning releasing packets pending
    on receive queue will be delayed until the last reference of tipc
    socket will be released.

    Signed-off-by: Ying Xue
    Signed-off-by: David S. Miller

    Ying Xue
     
  • A truncated cb_compound request will cause the client to decode null or
    data from a previous callback for nfs4.1 backchannel case, or uninitialized
    data for the nfs4.0 case. This is because the path through
    svc_process_common() advances the request's iov_base and decrements iov_len
    without adjusting the overall xdr_buf's len field. That causes
    xdr_init_decode() to set up the xdr_stream with an incorrect length in
    nfs4_callback_compound().

    Fixing this for the nfs4.1 backchannel case first requires setting the
    correct iov_len and page_len based on the length of received data in the
    same manner as the nfs4.0 case.

    Then the request's xdr_buf length can be adjusted for both cases based upon
    the remaining iov_len and page_len.

    Signed-off-by: Benjamin Coddington
    Cc: stable@vger.kernel.org
    Signed-off-by: Trond Myklebust

    Benjamin Coddington
     
  • WARN_ON_ONCE() takes a condition, it doesn't take an error message. I
    have converted this to WARN() instead.

    Signed-off-by: Dan Carpenter
    Signed-off-by: David S. Miller

    Dan Carpenter
     
  • Rainer Weikusat writes:
    An AF_UNIX datagram socket being the client in an n:1 association with
    some server socket is only allowed to send messages to the server if the
    receive queue of this socket contains at most sk_max_ack_backlog
    datagrams. This implies that prospective writers might be forced to go
    to sleep despite none of the message presently enqueued on the server
    receive queue were sent by them. In order to ensure that these will be
    woken up once space becomes again available, the present unix_dgram_poll
    routine does a second sock_poll_wait call with the peer_wait wait queue
    of the server socket as queue argument (unix_dgram_recvmsg does a wake
    up on this queue after a datagram was received). This is inherently
    problematic because the server socket is only guaranteed to remain alive
    for as long as the client still holds a reference to it. In case the
    connection is dissolved via connect or by the dead peer detection logic
    in unix_dgram_sendmsg, the server socket may be freed despite "the
    polling mechanism" (in particular, epoll) still has a pointer to the
    corresponding peer_wait queue. There's no way to forcibly deregister a
    wait queue with epoll.

    Based on an idea by Jason Baron, the patch below changes the code such
    that a wait_queue_t belonging to the client socket is enqueued on the
    peer_wait queue of the server whenever the peer receive queue full
    condition is detected by either a sendmsg or a poll. A wake up on the
    peer queue is then relayed to the ordinary wait queue of the client
    socket via wake function. The connection to the peer wait queue is again
    dissolved if either a wake up is about to be relayed or the client
    socket reconnects or a dead peer is detected or the client socket is
    itself closed. This enables removing the second sock_poll_wait from
    unix_dgram_poll, thus avoiding the use-after-free, while still ensuring
    that no blocked writer sleeps forever.

    Signed-off-by: Rainer Weikusat
    Fixes: ec0d215f9420 ("af_unix: fix 'poll for write'/connected DGRAM sockets")
    Reviewed-by: Jason Baron
    Signed-off-by: David S. Miller

    Rainer Weikusat
     
  • The classid of a process is changed either when a process is moved to
    or from a cgroup or when the net_cls.classid file is updated.
    Previously net_cls only supported propogating these changes to the
    cgroup's related sockets when a process was added or removed from the
    cgroup. This means it was neccessary to remove and re-add all processes
    to a cgroup in order to update its classid. This change introduces
    support for doing this dynamically - i.e. when the value is changed in
    the net_cls_classid file, this will also trigger an update to the
    classid associated with all sockets controlled by the cgroup.
    This mimics the behaviour of other cgroup subsystems.
    net_prio circumvents this issue by storing an index into a table with
    each socket (and so any updates to the table, don't require updating
    the value associated with the socket). net_cls, however, passes the
    socket the classid directly, and so this additional step is needed.

    Signed-off-by: Nina Schiff
    Acked-by: Tejun Heo
    Signed-off-by: David S. Miller

    Nina Schiff
     

23 Nov, 2015

2 commits

  • Similar to ipv4, when destroying an mrt table the static mfc entries and
    the static devices are kept, which leads to devices that can never be
    destroyed (because of refcnt taken) and leaked memory. Make sure that
    everything is cleaned up on netns destruction.

    Fixes: 8229efdaef1e ("netns: ip6mr: enable namespace support in ipv6 multicast forwarding code")
    CC: Benjamin Thery
    Signed-off-by: Nikolay Aleksandrov
    Reviewed-by: Cong Wang
    Signed-off-by: David S. Miller

    Nikolay Aleksandrov
     
  • When destroying an mrt table the static mfc entries and the static
    devices are kept, which leads to devices that can never be destroyed
    (because of refcnt taken) and leaked memory, for example:
    unreferenced object 0xffff880034c144c0 (size 192):
    comm "mfc-broken", pid 4777, jiffies 4320349055 (age 46001.964s)
    hex dump (first 32 bytes):
    98 53 f0 34 00 88 ff ff 98 53 f0 34 00 88 ff ff .S.4.....S.4....
    ef 0a 0a 14 01 02 03 04 00 00 00 00 01 00 00 00 ................
    backtrace:
    [] kmemleak_alloc+0x4e/0xb0
    [] kmem_cache_alloc+0x190/0x300
    [] ip_mroute_setsockopt+0x5cb/0x910
    [] do_ip_setsockopt.isra.11+0x105/0xff0
    [] ip_setsockopt+0x30/0xa0
    [] raw_setsockopt+0x33/0x90
    [] sock_common_setsockopt+0x14/0x20
    [] SyS_setsockopt+0x71/0xc0
    [] entry_SYSCALL_64_fastpath+0x16/0x7a
    [] 0xffffffffffffffff

    Make sure that everything is cleaned on netns destruction.

    Signed-off-by: Nikolay Aleksandrov
    Reviewed-by: Cong Wang
    Signed-off-by: David S. Miller

    Nikolay Aleksandrov