01 Aug, 2015

2 commits

  • Pull networking fixes from David Miller:

    1) Must teardown SR-IOV before unregistering netdev in igb driver, from
    Alex Williamson.

    2) Fix ipv6 route unreachable crash in IPVS, from Alex Gartrell.

    3) Default route selection in ipv4 should take the prefix length, table
    ID, and TOS into account, from Julian Anastasov.

    4) sch_plug must have a reset method in order to purge all buffered
    packets when the qdisc is reset, likewise for sch_choke, from WANG
    Cong.

    5) Fix deadlock and races in slave_changelink/br_setport in bridging.
    From Nikolay Aleksandrov.

    6) mlx4 bug fixes (wrong index in port even propagation to VFs,
    overzealous BUG_ON assertion, etc.) from Ido Shamay, Jack
    Morgenstein, and Or Gerlitz.

    7) Turn off klog message about SCTP userspace interface compat that
    makes no sense at all, from Daniel Borkmann.

    8) Fix unbounded restarts of inet frag eviction process, causing NMI
    watchdog soft lockup messages, from Florian Westphal.

    9) Suspend/resume fixes for r8152 from Hayes Wang.

    10) Fix busy loop when MSG_WAITALL|MSG_PEEK is used in TCP recv, from
    Sabrina Dubroca.

    11) Fix performance regression when removing a lot of routes from the
    ipv4 routing tables, from Alexander Duyck.

    12) Fix device leak in AF_PACKET, from Lars Westerhoff.

    13) AF_PACKET also has a header length comparison bug due to signedness,
    from Alexander Drozdov.

    14) Fix bug in EBPF tail call generation on x86, from Daniel Borkmann.

    15) Memory leaks, TSO stats, watchdog timeout and other fixes to
    thunderx driver from Sunil Goutham and Thanneeru Srinivasulu.

    16) act_bpf can leak memory when replacing programs, from Daniel
    Borkmann.

    17) WOL packet fixes in gianfar driver, from Claudiu Manoil.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (79 commits)
    stmmac: fix missing MODULE_LICENSE in stmmac_platform
    gianfar: Enable device wakeup when appropriate
    gianfar: Fix suspend/resume for wol magic packet
    gianfar: Fix warning when CONFIG_PM off
    act_pedit: check binding before calling tcf_hash_release()
    net: sk_clone_lock() should only do get_net() if the parent is not a kernel socket
    net: sched: fix refcount imbalance in actions
    r8152: reset device when tx timeout
    r8152: add pre_reset and post_reset
    qlcnic: Fix corruption while copying
    act_bpf: fix memory leaks when replacing bpf programs
    net: thunderx: Fix for crash while BGX teardown
    net: thunderx: Add PCI driver shutdown routine
    net: thunderx: Fix crash when changing rss with mutliple traffic flows
    net: thunderx: Set watchdog timeout value
    net: thunderx: Wakeup TXQ only if CQE_TX are processed
    net: thunderx: Suppress alloc_pages() failure warnings
    net: thunderx: Fix TSO packet statistic
    net: thunderx: Fix memory leak when changing queue count
    net: thunderx: Fix RQ_DROP miscalculation
    ...

    Linus Torvalds
     
  • When we share an action within a filter, the bind refcnt
    should increase, therefore we should not call tcf_hash_release().

    Fixes: 1a29321ed045 ("net_sched: act: Dont increment refcnt on replace")
    Cc: Jamal Hadi Salim
    Cc: Daniel Borkmann
    Signed-off-by: Cong Wang
    Signed-off-by: Cong Wang
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    WANG Cong
     

31 Jul, 2015

2 commits

  • The newsk returned by sk_clone_lock should hold a get_net()
    reference if, and only if, the parent is not a kernel socket
    (making this similar to sk_alloc()).

    E.g,. for the SYN_RECV path, tcp_v4_syn_recv_sock->..inet_csk_clone_lock
    sets up the syn_recv newsk from sk_clone_lock. When the parent (listen)
    socket is a kernel socket (defined in sk_alloc() as having
    sk_net_refcnt == 0), then the newsk should also have a 0 sk_net_refcnt
    and should not hold a get_net() reference.

    Fixes: 26abe14379f8 ("net: Modify sk_alloc to not reference count the
    netns of kernel sockets.")
    Acked-by: Eric Dumazet
    Cc: Eric W. Biederman
    Signed-off-by: Sowmini Varadhan
    Signed-off-by: David S. Miller

    Sowmini Varadhan
     
  • Since commit 55334a5db5cd ("net_sched: act: refuse to remove bound action
    outside"), we end up with a wrong reference count for a tc action.

    Test case 1:

    FOO="1,6 0 0 4294967295,"
    BAR="1,6 0 0 4294967294,"
    tc filter add dev foo parent 1: bpf bytecode "$FOO" flowid 1:1 \
    action bpf bytecode "$FOO"
    tc actions show action bpf
    action order 0: bpf bytecode '1,6 0 0 4294967295' default-action pipe
    index 1 ref 1 bind 1
    tc actions replace action bpf bytecode "$BAR" index 1
    tc actions show action bpf
    action order 0: bpf bytecode '1,6 0 0 4294967294' default-action pipe
    index 1 ref 2 bind 1
    tc actions replace action bpf bytecode "$FOO" index 1
    tc actions show action bpf
    action order 0: bpf bytecode '1,6 0 0 4294967295' default-action pipe
    index 1 ref 3 bind 1

    Test case 2:

    FOO="1,6 0 0 4294967295,"
    tc filter add dev foo parent 1: bpf bytecode "$FOO" flowid 1:1 action ok
    tc actions show action gact
    action order 0: gact action pass
    random type none pass val 0
    index 1 ref 1 bind 1
    tc actions add action drop index 1
    RTNETLINK answers: File exists [...]
    tc actions show action gact
    action order 0: gact action pass
    random type none pass val 0
    index 1 ref 2 bind 1
    tc actions add action drop index 1
    RTNETLINK answers: File exists [...]
    tc actions show action gact
    action order 0: gact action pass
    random type none pass val 0
    index 1 ref 3 bind 1

    What happens is that in tcf_hash_check(), we check tcf_common for a given
    index and increase tcfc_refcnt and conditionally tcfc_bindcnt when we've
    found an existing action. Now there are the following cases:

    1) We do a late binding of an action. In that case, we leave the
    tcfc_refcnt/tcfc_bindcnt increased and are done with the ->init()
    handler. This is correctly handeled.

    2) We replace the given action, or we try to add one without replacing
    and find out that the action at a specific index already exists
    (thus, we go out with error in that case).

    In case of 2), we have to undo the reference count increase from
    tcf_hash_check() in the tcf_hash_check() function. Currently, we fail to
    do so because of the 'tcfc_bindcnt > 0' check which bails out early with
    an -EPERM error.

    Now, while commit 55334a5db5cd prevents 'tc actions del action ...' on an
    already classifier-bound action to drop the reference count (which could
    then become negative, wrap around etc), this restriction only accounts for
    invocations outside a specific action's ->init() handler.

    One possible solution would be to add a flag thus we possibly trigger
    the -EPERM ony in situations where it is indeed relevant.

    After the patch, above test cases have correct reference count again.

    Fixes: 55334a5db5cd ("net_sched: act: refuse to remove bound action outside")
    Signed-off-by: Daniel Borkmann
    Reviewed-by: Cong Wang
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

30 Jul, 2015

5 commits

  • We currently trigger multiple memory leaks when replacing bpf
    actions, besides others:

    comm "tc", pid 1909, jiffies 4294851310 (age 1602.796s)
    hex dump (first 32 bytes):
    01 00 00 00 03 00 00 00 00 00 00 00 00 00 00 00 ................
    18 b0 98 6d 00 88 ff ff 00 00 00 00 00 00 00 00 ...m............
    backtrace:
    [] kmemleak_alloc+0x4e/0xb0
    [] __vmalloc_node_range+0x1bd/0x2c0
    [] __vmalloc+0x4a/0x50
    [] bpf_prog_alloc+0x3a/0xa0
    [] bpf_prog_create+0x44/0xa0
    [] tcf_bpf_init+0x28b/0x3c0 [act_bpf]
    [] tcf_action_init_1+0x191/0x1b0
    [] tcf_action_init+0x82/0xf0
    [] tcf_exts_validate+0xb2/0xc0
    [] cls_bpf_modify_existing+0x98/0x340 [cls_bpf]
    [] cls_bpf_change+0x1a6/0x274 [cls_bpf]
    [] tc_ctl_tfilter+0x335/0x910
    [] rtnetlink_rcv_msg+0x95/0x240
    [] netlink_rcv_skb+0xaf/0xc0
    [] rtnetlink_rcv+0x2e/0x40
    [] netlink_unicast+0xef/0x1b0

    Issue is that the old content from tcf_bpf is allocated and needs
    to be released when we replace it. We seem to do that since the
    beginning of act_bpf on the filter and insns, later on the name as
    well.

    Example test case, after patch:

    # FOO="1,6 0 0 4294967295,"
    # BAR="1,6 0 0 4294967294,"
    # tc actions add action bpf bytecode "$FOO" index 2
    # tc actions show action bpf
    action order 0: bpf bytecode '1,6 0 0 4294967295' default-action pipe
    index 2 ref 1 bind 0
    # tc actions replace action bpf bytecode "$BAR" index 2
    # tc actions show action bpf
    action order 0: bpf bytecode '1,6 0 0 4294967294' default-action pipe
    index 2 ref 1 bind 0
    # tc actions replace action bpf bytecode "$FOO" index 2
    # tc actions show action bpf
    action order 0: bpf bytecode '1,6 0 0 4294967295' default-action pipe
    index 2 ref 1 bind 0
    # tc actions del action bpf index 2
    [...]
    # echo "scan" > /sys/kernel/debug/kmemleak
    # cat /sys/kernel/debug/kmemleak | grep "comm \"tc\"" | wc -l
    0

    Fixes: d23b8ad8ab23 ("tc: add BPF based action")
    Signed-off-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • This patch is the IPv6 equivalent of commit
    6c8b4e3ff81b ("arp: flush arp cache on IFF_NOARP change")

    Without it, we keep buggy neighbours in the cache, with destination
    MAC address equal to our own MAC address.

    Tested:
    tcpdump -i eth0 -s 0 ip6 -n -e &
    ip link set dev eth0 arp off
    ping6 remote // sends buggy frames
    ip link set dev eth0 arp on
    ping6 remote // should work once kernel is patched

    Signed-off-by: Eric Dumazet
    Reported-by: Mario Fanelli
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Since mdb states were introduced when deleting an entry the state was
    left as it was set in the delete request from the user which leads to
    the following output when doing a monitor (for example):
    $ bridge mdb add dev br0 port eth3 grp 239.0.0.1 permanent
    (monitor) dev br0 port eth3 grp 239.0.0.1 permanent
    $ bridge mdb del dev br0 port eth3 grp 239.0.0.1 permanent
    (monitor) dev br0 port eth3 grp 239.0.0.1 temp
    ^^^
    Note the "temp" state in the delete notification which is wrong since
    the entry was permanent, the state in a delete is always reported as
    "temp" regardless of the real state of the entry.

    After this patch:
    $ bridge mdb add dev br0 port eth3 grp 239.0.0.1 permanent
    (monitor) dev br0 port eth3 grp 239.0.0.1 permanent
    $ bridge mdb del dev br0 port eth3 grp 239.0.0.1 permanent
    (monitor) dev br0 port eth3 grp 239.0.0.1 permanent

    There's one important note to make here that the state is actually not
    matched when doing a delete, so one can delete a permanent entry by
    stating "temp" in the end of the command, I've chosen this fix in order
    not to break user-space tools which rely on this (incorrect) behaviour.

    So to give an example after this patch and using the wrong state:
    $ bridge mdb add dev br0 port eth3 grp 239.0.0.1 permanent
    (monitor) dev br0 port eth3 grp 239.0.0.1 permanent
    $ bridge mdb del dev br0 port eth3 grp 239.0.0.1 temp
    (monitor) dev br0 port eth3 grp 239.0.0.1 permanent

    Note the state of the entry that got deleted is correct in the
    notification.

    Signed-off-by: Nikolay Aleksandrov
    Fixes: ccb1c31a7a87 ("bridge: add flags to distinguish permanent mdb entires")
    Signed-off-by: David S. Miller

    Nikolay Aleksandrov
     
  • When fast leave is configured on a bridge port and an IGMP leave is
    received for a group, the group is not deleted immediately if there is
    a router detected or if multicast querier is configured.
    Ideally the group should be deleted immediately when fast leave is
    configured.

    Signed-off-by: Satish Ashok
    Signed-off-by: David S. Miller

    Satish Ashok
     
  • There are several devices that can receive vlan tagged packets with
    CHECKSUM_PARTIAL like tap, possibly veth and xennet.
    When (multiple) vlan tagged packets with CHECKSUM_PARTIAL are forwarded
    by bridge to a device with the IP_CSUM feature, they end up with checksum
    error because before entering bridge, the network header is set to
    ETH_HLEN (not including vlan header length) in __netif_receive_skb_core(),
    get_rps_cpu(), or drivers' rx functions, and nobody fixes the pointer later.

    Since the network header is exepected to be ETH_HLEN in flow-dissection
    and hash-calculation in RPS in rx path, and since the header pointer fix
    is needed only in tx path, set the appropriate network header on forwarding
    packets.

    Signed-off-by: Toshiaki Makita
    Signed-off-by: David S. Miller

    Toshiaki Makita
     

29 Jul, 2015

5 commits

  • tpacket_fill_skb() can return a negative value (-errno) which
    is stored in tp_len variable. In that case the following
    condition will be (but shouldn't be) true:

    tp_len > dev->mtu + dev->hard_header_len

    as dev->mtu and dev->hard_header_len are both unsigned.

    That may lead to just returning an incorrect EMSGSIZE errno
    to the user.

    Fixes: 52f1454f629fa ("packet: allow to transmit +4 byte in TX_RING slot for VLAN case")
    Signed-off-by: Alexander Drozdov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Alexander Drozdov
     
  • When arp is off on a device, and ioctl(SIOCGARP) is queried,
    a buggy answer is given with MAC address of the device, instead
    of the mac address of the destination/gateway.

    We filter out NUD_NOARP neighbours for /proc/net/arp,
    we must do the same for SIOCGARP ioctl.

    Tested:

    lpaa23:~# ./arp 10.246.7.190
    MAC=00:01:e8:22:cb:1d // correct answer

    lpaa23:~# ip link set dev eth0 arp off
    lpaa23:~# cat /proc/net/arp # check arp table is now 'empty'
    IP address HW type Flags HW address Mask Device
    lpaa23:~# ./arp 10.246.7.190
    MAC=00:1a:11:c3:0d:7f // buggy answer before patch (this is eth0 mac)

    After patch :

    lpaa23:~# ip link set dev eth0 arp off
    lpaa23:~# ./arp 10.246.7.190
    ioctl(SIOCGARP) failed: No such device or address

    Signed-off-by: Eric Dumazet
    Reported-by: Vytautas Valancius
    Cc: Willem de Bruijn
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • This notification causes the FIB to be updated, which is not needed
    because the address already exists, and more importantly it may undo
    intentional changes that were made to the FIB after the address was
    originally added. (As a point of comparison, when an address becomes
    deprecated because its preferred lifetime expired, a notification on
    this chain is not generated.)

    The motivation for this commit is fixing an incompatibility between
    DHCP clients which set and update the address lifetime according to
    the lease, and a commercial VPN client which replaces kernel routes
    in a way that outbound traffic is sent only through the tunnel (and
    disconnects if any further route changes are detected via netlink).

    Signed-off-by: David Ward
    Signed-off-by: David S. Miller

    David Ward
     
  • These should be handled only by the respective STP which is in control.
    They become problematic for devices with limited resources with many
    ports because the hold_timer is per port and fires each second and the
    hello timer fires each 2 seconds even though it's global. While in
    user-space STP mode these timers are completely unnecessary so it's better
    to keep them off.
    Also ensure that when the bridge is up these timers are started only when
    running with kernel STP.

    Signed-off-by: Satish Ashok
    Signed-off-by: Nikolay Aleksandrov
    Signed-off-by: David S. Miller

    Nikolay Aleksandrov
     
  • Pull NFS client bugfixes from Trond Myklebust:
    "Highlights include:

    Stable patches:
    - Fix a situation where the client uses the wrong (zero) stateid.
    - Fix a memory leak in nfs_do_recoalesce

    Bugfixes:
    - Plug a memory leak when ->prepare_layoutcommit fails
    - Fix an Oops in the NFSv4 open code
    - Fix a backchannel deadlock
    - Fix a livelock in sunrpc when sendmsg fails due to low memory
    availability
    - Don't revalidate the mapping if both size and change attr are up to
    date
    - Ensure we don't miss a file extension when doing pNFS
    - Several fixes to handle NFSv4.1 sequence operation status bits
    correctly
    - Several pNFS layout return bugfixes"

    * tag 'nfs-for-4.2-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (28 commits)
    nfs: Fix an oops caused by using other thread's stack space in ASYNC mode
    nfs: plug memory leak when ->prepare_layoutcommit fails
    SUNRPC: Report TCP errors to the caller
    sunrpc: translate -EAGAIN to -ENOBUFS when socket is writable.
    NFSv4.2: handle NFS-specific llseek errors
    NFS: Don't clear desc->pg_moreio in nfs_do_recoalesce()
    NFS: Fix a memory leak in nfs_do_recoalesce
    NFS: nfs_mark_for_revalidate should always set NFS_INO_REVAL_PAGECACHE
    NFS: Remove the "NFS_CAP_CHANGE_ATTR" capability
    NFS: Set NFS_INO_REVAL_PAGECACHE if the change attribute is uninitialised
    NFS: Don't revalidate the mapping if both size and change attr are up to date
    NFSv4/pnfs: Ensure we don't miss a file extension
    NFSv4: We must set NFS_OPEN_STATE flag in nfs_resync_open_stateid_locked
    SUNRPC: xprt_complete_bc_request must also decrement the free slot count
    SUNRPC: Fix a backchannel deadlock
    pNFS: Don't throw out valid layout segments
    pNFS: pnfs_roc_drain() fix a race with open
    pNFS: Fix races between return-on-close and layoutreturn.
    pNFS: pnfs_roc_drain should return 'true' when sleeping
    pNFS: Layoutreturn must invalidate all existing layout segments.
    ...

    Linus Torvalds
     

28 Jul, 2015

3 commits

  • When binding a PF_PACKET socket, the use count of the bound interface is
    always increased with dev_hold in dev_get_by_{index,name}. However,
    when rebound with the same protocol and device as in the previous bind
    the use count of the interface was not decreased. Ultimately, this
    caused the deletion of the interface to fail with the following message:

    unregister_netdevice: waiting for dummy0 to become free. Usage count = 1

    This patch moves the dev_put out of the conditional part that was only
    executed when either the protocol or device changed on a bind.

    Fixes: 902fefb82ef7 ('packet: improve socket create/bind latency in some cases')
    Signed-off-by: Lars Westerhoff
    Signed-off-by: Dan Carpenter
    Reviewed-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Lars Westerhoff
     
  • Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • It was reported that update_suffix was taking a long time on systems where
    a large number of leaves were attached to a single node. As it turns out
    fib_table_flush was calling update_suffix for each leaf that didn't have all
    of the aliases stripped from it. As a result, on this large node removing
    one leaf would result in us calling update_suffix for every other leaf on
    the node.

    The fix is to just remove the calls to leaf_pull_suffix since they are
    redundant as we already have a call in resize that will go through and
    update the suffix length for the node before we exit out of
    fib_table_flush or fib_table_flush_external.

    Reported-by: David Ahern
    Signed-off-by: Alexander Duyck
    Tested-by: David Ahern
    Signed-off-by: David S. Miller

    Alexander Duyck
     

27 Jul, 2015

9 commits

  • The networking layer does not reliably report the distinction between
    a non-block write failing because:
    1/ the queue is too full already and
    2/ a memory allocation attempt failed.

    The distinction is important because in the first case it is
    appropriate to retry as soon as the socket reports that it is
    writable, and in the second case a small delay is required as the
    socket will most likely report as writable but kmalloc could still
    fail.

    sk_stream_wait_memory() exhibits this distinction nicely, setting
    'vm_wait' if a small wait is needed. However in the non-blocking case
    it always returns -EAGAIN no matter the cause of the failure. This
    -EAGAIN call get all the way to sunrpc.

    The sunrpc layer expects EAGAIN to indicate the first cause, and
    ENOBUFS to indicate the second. Various documentation suggests that
    this is not unreasonable, but does not guarantee the desired error
    codes.

    The result of getting -EAGAIN when -ENOBUFS is expected is that the
    send is tried again in a tight loop and soft lockups are reported.

    so: add tests after calls to xs_sendpages() to translate -EAGAIN into
    -ENOBUFS if the socket is writable. This cannot happen inside
    xs_sendpages() as the test for "is socket writable" is different
    between TCP and UDP.

    With this change, the tight loop retrying xs_sendpages() becomes a
    loop which only retries every 250ms, and so will not trigger a
    soft-lockup warning.

    It is possible that the write did fail because the queue was too full
    and by the time xs_sendpages() completed, the queue was writable
    again. In this case an extra 250ms delay is inserted that isn't
    really needed. This circumstance suggests a degree of congestion so a
    delay is not necessarily a bad thing, and it can only cause a single
    250ms delay, not a series of them.

    Signed-off-by: NeilBrown
    Signed-off-by: Trond Myklebust

    NeilBrown
     
  • Currently, tcp_recvmsg enters a busy loop in sk_wait_data if called
    with flags = MSG_WAITALL | MSG_PEEK.

    sk_wait_data waits for sk_receive_queue not empty, but in this case,
    the receive queue is not empty, but does not contain any skb that we
    can use.

    Add a "last skb seen on receive queue" argument to sk_wait_data, so
    that it sleeps until the receive queue has new skbs.

    Link: https://bugzilla.kernel.org/show_bug.cgi?id=99461
    Link: https://sourceware.org/bugzilla/show_bug.cgi?id=18493
    Link: https://bugzilla.redhat.com/show_bug.cgi?id=1205258
    Reported-by: Enrico Scholz
    Reported-by: Dan Searle
    Signed-off-by: Sabrina Dubroca
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Sabrina Dubroca
     
  • Johan Hedberg says:

    ====================
    pull request: bluetooth 2015-07-23

    Here's another one-patch pull request for 4.2 which targets a potential
    NULL pointer dereference in the LE Security Manager code that can be
    triggered by using older user space tools. The issue has been there
    since 4.0 so there's the appropriate "Cc: stable" in place.

    Let me know if there are any issues pulling. Thanks.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • We can simply remove the INET_FRAG_EVICTED flag to avoid all the flags
    race conditions with the evictor and use a participation test for the
    evictor list, when we're at that point (after inet_frag_kill) in the
    timer there're 2 possible cases:

    1. The evictor added the entry to its evictor list while the timer was
    waiting for the chainlock
    or
    2. The timer unchained the entry and the evictor won't see it

    In both cases we should be able to see list_evictor correctly due
    to the sync on the chainlock.

    Joint work with Florian Westphal.

    Tested-by: Frank Schreuder
    Signed-off-by: Nikolay Aleksandrov
    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Nikolay Aleksandrov
     
  • Frank reports 'NMI watchdog: BUG: soft lockup' errors when
    load is high. Instead of (potentially) unbounded restarts of the
    eviction process, just skip to the next entry.

    One caveat is that, when a netns is exiting, a timer may still be running
    by the time inet_evict_bucket returns.

    We use the frag memory accounting to wait for outstanding timers,
    so that when we free the percpu counter we can be sure no running
    timer will trip over it.

    Reported-and-tested-by: Frank Schreuder
    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     
  • Followup patch will call it after inet_frag_queue was freed, so q->net
    doesn't work anymore (but netf = q->net; free(q); mem_limit(netf) would).

    Tested-by: Frank Schreuder
    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     
  • commit 65ba1f1ec0eff ("inet: frags: fix a race between inet_evict_bucket
    and inet_frag_kill") describes the bug, but the fix doesn't work reliably.

    Problem is that ->flags member can be set on other cpu without chainlock
    being held by that task, i.e. the RMW-Cycle can clear INET_FRAG_EVICTED
    bit after we put the element on the evictor private list.

    We can crash when walking the 'private' evictor list since an element can
    be deleted from list underneath the evictor.

    Join work with Nikolay Alexandrov.

    Fixes: b13d3cbfb8e8 ("inet: frag: move eviction of queues to work queue")
    Reported-by: Johan Schuijt
    Tested-by: Frank Schreuder
    Signed-off-by: Nikolay Alexandrov
    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     
  • Back then when we added support for SCTP_SNDINFO/SCTP_RCVINFO from
    RFC6458 5.3.4/5.3.5, we decided to add a deprecation warning for the
    (as per RFC deprecated) SCTP_SNDRCV via commit bbbea41d5e53 ("net:
    sctp: deprecate rfc6458, 5.3.2. SCTP_SNDRCV support"), see [1].

    Imho, it was not a good idea, and we should just revert that message
    for a couple of reasons:

    1) It's uapi and therefore set in stone forever.

    2) To be able to run on older and newer kernels, an SCTP application
    would need to probe for both, SCTP_SNDRCV, but also SCTP_SNDINFO/
    SCTP_RCVINFO support, so that on older kernels, it can make use
    of SCTP_SNDRCV, and on newer kernels SCTP_SNDINFO/SCTP_RCVINFO.
    In my (limited) experience, a lot of SCTP appliances are migrating
    to newer kernels only ve(ee)ry slowly.

    3) Some people don't have the chance to change their applications,
    f.e. due to proprietary legacy stuff. So, they'll hit this warning
    in fast path and are stuck with older kernels.

    But i.e. due to point 1) I really fail to see the benefit of a warning.
    So just revert that for now, the issue was reported up Jamal.

    [1] http://thread.gmane.org/gmane.linux.network/321960/

    Reported-by: Jamal Hadi Salim
    Signed-off-by: Daniel Borkmann
    Cc: Michael Tuexen
    Acked-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Since slave_changelink support was added there have been a few race
    conditions when using br_setport() since some of the port functions it
    uses require the bridge lock. It is very easy to trigger a lockup due to
    some internal spin_lock() usage without bh disabled, also it's possible to
    get the bridge into an inconsistent state.

    Signed-off-by: Nikolay Aleksandrov
    Fixes: 3ac636b8591c ("bridge: implement rtnl_link_ops->slave_changelink")
    Reviewed-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Nikolay Aleksandrov
     

25 Jul, 2015

6 commits

  • Pablo Neira Ayuso says:

    ====================
    Netfilter/IPVS fixes for net

    The following patchset contains ten Netfilter/IPVS fixes, they are:

    1) Address refcount leak when creating an expectation from the ctnetlink
    interface.

    2) Fix bug splat in the IDLETIMER target related to sysfs, from Dmitry
    Torokhov.

    3) Resolve panic for unreachable route in IPVS with locally generated
    traffic in the output path, from Alex Gartrell.

    4) Fix wrong source address in rare cases for tunneled traffic in IPVS,
    from Julian Anastasov.

    5) Fix crash if scheduler is changed via ipvsadm -E, again from Julian.

    6) Make sure skb->sk is unset for forwarded traffic through IPVS, again from
    Alex Gartrell.

    7) Fix crash with IPVS sync protocol v0 and FTP, from Julian.

    8) Reset sender cpu for forwarded traffic in IPVS, also from Julian.

    9) Allocate template conntracks through kmalloc() to resolve netns dependency
    problems with the conntrack kmem_cache.

    10) Fix zones with expectations that clash using the same tuple, from Joe
    Stringer.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • In dev_queue_xmit() net_cls protected with rcu-bh.

    [ 270.730026] ===============================
    [ 270.730029] [ INFO: suspicious RCU usage. ]
    [ 270.730033] 4.2.0-rc3+ #2 Not tainted
    [ 270.730036] -------------------------------
    [ 270.730040] include/linux/cgroup.h:353 suspicious rcu_dereference_check() usage!
    [ 270.730041] other info that might help us debug this:
    [ 270.730043] rcu_scheduler_active = 1, debug_locks = 1
    [ 270.730045] 2 locks held by dhclient/748:
    [ 270.730046] #0: (rcu_read_lock_bh){......}, at: [] __dev_queue_xmit+0x50/0x960
    [ 270.730085] #1: (&qdisc_tx_lock){+.....}, at: [] __dev_queue_xmit+0x240/0x960
    [ 270.730090] stack backtrace:
    [ 270.730096] CPU: 0 PID: 748 Comm: dhclient Not tainted 4.2.0-rc3+ #2
    [ 270.730098] Hardware name: OpenStack Foundation OpenStack Nova, BIOS Bochs 01/01/2011
    [ 270.730100] 0000000000000001 ffff8800bafeba58 ffffffff817ad487 0000000000000007
    [ 270.730103] ffff880232a0a780 ffff8800bafeba88 ffffffff810ca4f2 ffff88022fb23e00
    [ 270.730105] ffff880232a0a780 ffff8800bafebb68 ffff8800bafebb68 ffff8800bafebaa8
    [ 270.730108] Call Trace:
    [ 270.730121] [] dump_stack+0x4c/0x65
    [ 270.730148] [] lockdep_rcu_suspicious+0xe2/0x120
    [ 270.730153] [] task_cls_state+0x92/0xa0
    [ 270.730158] [] cls_cgroup_classify+0x4f/0x120 [cls_cgroup]
    [ 270.730164] [] tc_classify_compat+0x74/0xc0
    [ 270.730166] [] tc_classify+0x33/0x90
    [ 270.730170] [] htb_enqueue+0xaa/0x4a0 [sch_htb]
    [ 270.730172] [] __dev_queue_xmit+0x306/0x960
    [ 270.730174] [] ? __dev_queue_xmit+0x50/0x960
    [ 270.730176] [] dev_queue_xmit_sk+0x13/0x20
    [ 270.730185] [] dev_queue_xmit+0x10/0x20
    [ 270.730187] [] packet_snd.isra.62+0x54c/0x760
    [ 270.730190] [] packet_sendmsg+0x2f5/0x3f0
    [ 270.730203] [] ? sock_def_readable+0x5/0x190
    [ 270.730210] [] ? _raw_spin_unlock+0x2b/0x40
    [ 270.730216] [] ? unix_dgram_sendmsg+0x5cc/0x640
    [ 270.730219] [] sock_sendmsg+0x47/0x50
    [ 270.730221] [] sock_write_iter+0x7f/0xd0
    [ 270.730232] [] __vfs_write+0xa7/0xf0
    [ 270.730234] [] vfs_write+0xb8/0x190
    [ 270.730236] [] SyS_write+0x52/0xb0
    [ 270.730239] [] entry_SYSCALL_64_fastpath+0x12/0x76

    Signed-off-by: Konstantin Khlebnikov
    Signed-off-by: David S. Miller

    Konstantin Khlebnikov
     
  • Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    WANG Cong
     
  • Otherwise the skbuff related structures are not correctly
    refcount'ed.

    Cc: Jamal Hadi Salim
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    WANG Cong
     
  • fib_select_default considers alternative routes only when
    res->fi is for the first alias in res->fa_head. In the
    common case this can happen only when the initial lookup
    matches the first alias with highest TOS value. This
    prevents the alternative routes to require specific TOS.

    This patch solves the problem as follows:

    - routes that require specific TOS should be returned by
    fib_select_default only when TOS matches, as already done
    in fib_table_lookup. This rule implies that depending on the
    TOS we can have many different lists of alternative gateways
    and we have to keep the last used gateway (fa_default) in first
    alias for the TOS instead of using single tb_default value.

    - as the aliases are ordered by many keys (TOS desc,
    fib_priority asc), we restrict the possible results to
    routes with matching TOS and lowest metric (fib_priority)
    and routes that match any TOS, again with lowest metric.

    For example, packet with TOS 8 can not use gw3 (not lowest
    metric), gw4 (different TOS) and gw6 (not lowest metric),
    all other gateways can be used:

    tos 8 via gw1 metric 2 fa_head and res->fi
    tos 8 via gw2 metric 2
    tos 8 via gw3 metric 3
    tos 4 via gw4
    tos 0 via gw5
    tos 0 via gw6 metric 1

    Reported-by: Hagen Paul Pfeifer
    Signed-off-by: Julian Anastasov
    Signed-off-by: David S. Miller

    Julian Anastasov
     
  • fib_trie starting from 4.1 can link fib aliases from
    different prefixes in same list. Make sure the alternative
    gateways are in same table and for same prefix (0) by
    checking tb_id and fa_slen.

    Fixes: 79e5ad2ceb00 ("fib_trie: Remove leaf_info")
    Signed-off-by: Julian Anastasov
    Signed-off-by: David S. Miller

    Julian Anastasov
     

24 Jul, 2015

1 commit

  • Pull virtio/vhost fixes from Michael Tsirkin:
    "Bugfixes and documentation fixes.

    Igor's patch that allows users to tweak memory table size is
    borderline, but it does fix known crashes, so I merged it"

    * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
    vhost: add max_mem_regions module parameter
    vhost: extend memory regions allocation to vmalloc
    9p/trans_virtio: reset virtio device on remove
    virtio/s390: rename drivers/s390/kvm -> drivers/s390/virtio
    MAINTAINERS: separate section for s390 virtio drivers
    virtio: define virtio_pci_cfg_cap in header.
    virtio: Fix typecast of pointer in vring_init()
    virtio scsi: fix unused variable warning
    vhost: use binary search instead of linear in find_region()
    virtio_net: document VIRTIO_NET_CTRL_GUEST_OFFLOADS

    Linus Torvalds
     

23 Jul, 2015

4 commits

  • The l2cap_conn->smp pointer may be NULL for various valid reasons where SMP has
    failed to initialize properly. One such scenario is when crypto support is
    missing, another when the adapter has been powered on through a legacy method.
    The smp_conn_security() function should have the appropriate check for this
    situation to avoid NULL pointer dereferences.

    Signed-off-by: Johan Hedberg
    Signed-off-by: Marcel Holtmann
    Cc: stable@vger.kernel.org # 4.0+

    Johan Hedberg
     
  • Pull networking fixes from David Miller:

    1) Don't use shared bluetooth antenna in iwlwifi driver for management
    frames, from Emmanuel Grumbach.

    2) Fix device ID check in ath9k driver, from Felix Fietkau.

    3) Off by one in xen-netback BUG checks, from Dan Carpenter.

    4) Fix IFLA_VF_PORT netlink attribute validation, from Daniel Borkmann.

    5) Fix races in setting peeked bit flag in SKBs during datagram
    receive. If it's shared we have to clone it otherwise the value can
    easily be corrupted. Fix from Herbert Xu.

    6) Revert fec clock handling change, causes regressions. From Fabio
    Estevam.

    7) Fix use after free in fq_codel and sfq packet schedulers, from WANG
    Cong.

    8) ipvlan bug fixes (memory leaks, missing rcu_dereference_bh, etc.)
    from WANG Cong and Konstantin Khlebnikov.

    9) Memory leak in act_bpf packet action, from Alexei Starovoitov.

    10) ARM bpf JIT bug fixes from Nicolas Schichan.

    11) Fix backwards compat of ANY_LAYOUT in virtio_net driver, from
    Michael S Tsirkin.

    12) Destruction of bond with different ARP header types not handled
    correctly, fix from Nikolay Aleksandrov.

    13) Revert GRO receive support in ipv6 SIT tunnel driver, causes
    regressions because the GRO packets created cannot be processed
    properly on the GSO side if we forward the frame. From Herbert Xu.

    14) TCCR update race and other fixes to ravb driver from Sergei
    Shtylyov.

    15) Fix SKB leaks in caif_queue_rcv_skb(), from Eric Dumazet.

    16) Fix panics on packet scheduler filter replace, from Daniel Borkmann.

    17) Make sure AF_PACKET sees properly IP headers in defragmented frames
    (via PACKET_FANOUT_FLAG_DEFRAG option), from Edward Hyunkoo Jee.

    18) AF_NETLINK cannot hold mutex in RCU callback, fix from Florian
    Westphal.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (84 commits)
    ravb: fix ring memory allocation
    net: phy: dp83867: Fix warning check for setting the internal delay
    openvswitch: allocate nr_node_ids flow_stats instead of num_possible_nodes
    netlink: don't hold mutex in rcu callback when releasing mmapd ring
    ARM: net: fix vlan access instructions in ARM JIT.
    ARM: net: handle negative offsets in BPF JIT.
    ARM: net: fix condition for load_order > 0 when translating load instructions.
    tcp: suppress a division by zero warning
    drivers: net: cpsw: remove tx event processing in rx napi poll
    inet: frags: fix defragmented packet's IP header for af_packet
    net: mvneta: fix refilling for Rx DMA buffers
    stmmac: fix setting of driver data in stmmac_dvr_probe
    sched: cls_flow: fix panic on filter replace
    sched: cls_flower: fix panic on filter replace
    sched: cls_bpf: fix panic on filter replace
    net/mdio: fix mdio_bus_match for c45 PHY
    net: ratelimit warnings about dst entry refcount underflow or overflow
    caif: fix leaks and race in caif_queue_rcv_skb()
    qmi_wwan: add the second QMI/network interface for Sierra Wireless MC7305/MC7355
    ravb: fix race updating TCCR
    ...

    Linus Torvalds
     
  • Calling xprt_complete_bc_request() effectively causes the slot to be allocated,
    so it needs to decrement the backchannel free slot count as well.

    Fixes: 0d2a970d0ae5 ("SUNRPC: Fix a backchannel race")
    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • xprt_alloc_bc_request() cannot call xprt_free_bc_request() without
    deadlocking, since it already holds the xprt->bc_pa_lock.

    Reported-by: Chuck Lever
    Fixes: 0d2a970d0ae55 ("SUNRPC: Fix a backchannel race")
    Signed-off-by: Trond Myklebust

    Trond Myklebust
     

22 Jul, 2015

3 commits

  • When zones were originally introduced, the expectation functions were
    all extended to perform lookup using the zone. However, insertion was
    not modified to check the zone. This means that two expectations which
    are intended to apply for different connections that have the same tuple
    but exist in different zones cannot both be tracked.

    Fixes: 5d0aa2ccd4 (netfilter: nf_conntrack: add support for "conntrack zones")
    Signed-off-by: Joe Stringer
    Signed-off-by: Pablo Neira Ayuso

    Joe Stringer
     
  • Some architectures like POWER can have a NUMA node_possible_map that
    contains sparse entries. This causes memory corruption with openvswitch
    since it allocates flow_cache with a multiple of num_possible_nodes() and
    assumes the node variable returned by for_each_node will index into
    flow->stats[node].

    Use nr_node_ids to allocate a maximal sparse array instead of
    num_possible_nodes().

    The crash was noticed after 3af229f2 was applied as it changed the
    node_possible_map to match node_online_map on boot.
    Fixes: 3af229f2071f5b5cb31664be6109561fbe19c861

    Signed-off-by: Chris J Arges
    Acked-by: Pravin B Shelar
    Acked-by: Nishanth Aravamudan
    Signed-off-by: David S. Miller

    Chris J Arges
     
  • Kirill A. Shutemov says:

    This simple test-case trigers few locking asserts in kernel:

    int main(int argc, char **argv)
    {
    unsigned int block_size = 16 * 4096;
    struct nl_mmap_req req = {
    .nm_block_size = block_size,
    .nm_block_nr = 64,
    .nm_frame_size = 16384,
    .nm_frame_nr = 64 * block_size / 16384,
    };
    unsigned int ring_size;
    int fd;

    fd = socket(AF_NETLINK, SOCK_RAW, NETLINK_GENERIC);
    if (setsockopt(fd, SOL_NETLINK, NETLINK_RX_RING, &req, sizeof(req)) < 0)
    exit(1);
    if (setsockopt(fd, SOL_NETLINK, NETLINK_TX_RING, &req, sizeof(req)) < 0)
    exit(1);

    ring_size = req.nm_block_nr * req.nm_block_size;
    mmap(NULL, 2 * ring_size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
    return 0;
    }

    +++ exited with 0 +++
    BUG: sleeping function called from invalid context at /home/kas/git/public/linux-mm/kernel/locking/mutex.c:616
    in_atomic(): 1, irqs_disabled(): 0, pid: 1, name: init
    3 locks held by init/1:
    #0: (reboot_mutex){+.+...}, at: [] SyS_reboot+0xa9/0x220
    #1: ((reboot_notifier_list).rwsem){.+.+..}, at: [] __blocking_notifier_call_chain+0x39/0x70
    #2: (rcu_callback){......}, at: [] rcu_do_batch.isra.49+0x160/0x10c0
    Preemption disabled at:[] __delay+0xf/0x20

    CPU: 1 PID: 1 Comm: init Not tainted 4.1.0-00009-gbddf4c4818e0 #253
    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Debian-1.8.2-1 04/01/2014
    ffff88017b3d8000 ffff88027bc03c38 ffffffff81929ceb 0000000000000102
    0000000000000000 ffff88027bc03c68 ffffffff81085a9d 0000000000000002
    ffffffff81ca2a20 0000000000000268 0000000000000000 ffff88027bc03c98
    Call Trace:
    [] dump_stack+0x4f/0x7b
    [] ___might_sleep+0x16d/0x270
    [] __might_sleep+0x4d/0x90
    [] mutex_lock_nested+0x2f/0x430
    [] ? _raw_spin_unlock_irqrestore+0x5d/0x80
    [] ? __this_cpu_preempt_check+0x13/0x20
    [] netlink_set_ring+0x1ed/0x350
    [] ? netlink_undo_bind+0x70/0x70
    [] netlink_sock_destruct+0x80/0x150
    [] __sk_free+0x1d/0x160
    [] sk_free+0x19/0x20
    [..]

    Cong Wang says:

    We can't hold mutex lock in a rcu callback, [..]

    Thomas Graf says:

    The socket should be dead at this point. It might be simpler to
    add a netlink_release_ring() function which doesn't require
    locking at all.

    Reported-by: "Kirill A. Shutemov"
    Diagnosed-by: Cong Wang
    Suggested-by: Thomas Graf
    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal