26 Sep, 2015

3 commits

  • Pull networking fixes from David Miller:

    1) When we run a tap on netlink sockets, we have to copy mmap'd SKBs
    instead of cloning them. From Daniel Borkmann.

    2) When converting classical BPF into eBPF, fix the setting of the
    source reg to BPF_REG_X. From Tycho Andersen.

    3) Fix igmpv3/mldv2 report parsing in the bridge multicast code, from
    Linus Lussing.

    4) Fix dst refcounting for ipv6 tunnels, from Martin KaFai Lau.

    5) Set NLM_F_REPLACE flag properly when replacing ipv6 routes, from
    Roopa Prabhu.

    6) Add some new cxgb4 PCI device IDs, from Hariprasad Shenai.

    7) Fix headroom tests and SKB leaks in ipv6 fragmentation code, from
    Florian Westphal.

    8) Check DMA mapping errors in bna driver, from Ivan Vecera.

    9) Several 8139cp bug fixes (dev_kfree_skb_any in interrupt context,
    misclearing of interrupt status in TX timeout handler, etc.) from
    David Woodhouse.

    10) In tipc, reset SKB header pointer after skb_linearize(), from Erik
    Hugne.

    11) Fix autobind races et al. in netlink code, from Herbert Xu with
    help from Tejun Heo and others.

    12) Missing SET_NETDEV_DEV in sunvnet driver, from Sowmini Varadhan.

    13) Fix various races in timewait timer and reqsk_queue_hadh_req, from
    Eric Dumazet.

    14) Fix array overruns in mac80211, from Johannes Berg and Dan
    Carpenter.

    15) Fix data race in rhashtable_rehash_one(), from Dmitriy Vyukov.

    16) Fix race between poll_one_napi and napi_disable, from Neil Horman.

    17) Fix byte order in geneve tunnel port config, from John W Linville.

    18) Fix handling of ARP replies over lightweight tunnels, from Jiri
    Benc.

    19) We can loop when fib rule dumps cross multiple SKBs, fix from Wilson
    Kok and Roopa Prabhu.

    20) Several reference count handling bug fixes in the PHY/MDIO layer
    from Russel King.

    21) Fix lockdep splat in ppp_dev_uninit(), from Guillaume Nault.

    22) Fix crash in icmp_route_lookup(), from David Ahern.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (116 commits)
    net: Fix panic in icmp_route_lookup
    net: update docbook comment for __mdiobus_register()
    ppp: fix lockdep splat in ppp_dev_uninit()
    net: via/Kconfig: GENERIC_PCI_IOMAP required if PCI not selected
    phy: marvell: add link partner advertised modes
    net: fix net_device refcounting
    phy: add phy_device_remove()
    phy: fixed-phy: properly validate phy in fixed_phy_update_state()
    net: fix phy refcounting in a bunch of drivers
    of_mdio: fix MDIO phy device refcounting
    phy: add proper phy struct device refcounting
    phy: fix mdiobus module safety
    net: dsa: fix of_mdio_find_bus() device refcount leak
    phy: fix of_mdio_find_bus() device refcount leak
    ip6_tunnel: Reduce log level in ip6_tnl_err() to debug
    ip6_gre: Reduce log level in ip6gre_err() to debug
    fib_rules: fix fib rule dumps across multiple skbs
    bnx2x: byte swap rss_key to comply to Toeplitz specs
    net: revert "net_sched: move tp->root allocation into fw_init()"
    lwtunnel: remove source and destination UDP port config option
    ...

    Linus Torvalds
     
  • Andrey reported a panic:

    [ 7249.865507] BUG: unable to handle kernel pointer dereference at 000000b4
    [ 7249.865559] IP: [] icmp_route_lookup+0xaa/0x320
    [ 7249.865598] *pdpt = 0000000030f7f001 *pde = 0000000000000000
    [ 7249.865637] Oops: 0000 [#1]
    ...
    [ 7249.866811] CPU: 0 PID: 0 Comm: swapper/0 Not tainted
    4.3.0-999-generic #201509220155
    [ 7249.866876] Hardware name: MSI MS-7250/MS-7250, BIOS 080014 08/02/2006
    [ 7249.866916] task: c1a5ab00 ti: c1a52000 task.ti: c1a52000
    [ 7249.866949] EIP: 0060:[] EFLAGS: 00210246 CPU: 0
    [ 7249.866981] EIP is at icmp_route_lookup+0xaa/0x320
    [ 7249.867012] EAX: 00000000 EBX: f483ba48 ECX: 00000000 EDX: f2e18a00
    [ 7249.867045] ESI: 000000c0 EDI: f483ba70 EBP: f483b9ec ESP: f483b974
    [ 7249.867077] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
    [ 7249.867108] CR0: 8005003b CR2: 000000b4 CR3: 36ee07c0 CR4: 000006f0
    [ 7249.867141] Stack:
    [ 7249.867165] 320310ee 00000000 00000042 320310ee 00000000 c1aeca00
    f3920240 f0c69180
    [ 7249.867268] f483ba04 f855058b a89b66cd f483ba44 f8962f4b 00000000
    e659266c f483ba54
    [ 7249.867361] 8004753c f483ba5c f8962f4b f2031140 000003c1 ffbd8fa0
    c16b0e00 00000064
    [ 7249.867448] Call Trace:
    [ 7249.867494] [] ? e1000_xmit_frame+0x87b/0xdc0 [e1000e]
    [ 7249.867534] [] ? tcp_in_window+0xeb/0xb10 [nf_conntrack]
    [ 7249.867576] [] ? tcp_in_window+0xeb/0xb10 [nf_conntrack]
    [ 7249.867615] [] ? icmp_send+0xa0/0x380
    [ 7249.867648] [] icmp_send+0x2cf/0x380
    [ 7249.867681] [] nf_send_unreach+0xa6/0xc0 [nf_reject_ipv4]
    [ 7249.867714] [] reject_tg+0x7a/0x9f [ipt_REJECT]
    [ 7249.867746] [] ipt_do_table+0x317/0x70c [ip_tables]
    [ 7249.867780] [] ? __nf_conntrack_find_get+0x166/0x3b0
    [nf_conntrack]
    [ 7249.867838] [] ? nf_conntrack_in+0x398/0x600 [nf_conntrack]
    [ 7249.867889] [] iptable_filter_hook+0x35/0x80 [iptable_filter]
    [ 7249.867933] [] nf_iterate+0x71/0x80
    [ 7249.867970] [] nf_hook_slow+0x65/0xc0
    [ 7249.868002] [] __ip_local_out_sk+0xc1/0xd0
    [ 7249.868034] [] ? ip_forward_options+0x1a0/0x1a0
    [ 7249.868066] [] ip_local_out_sk+0x16/0x30
    [ 7249.868097] [] ip_send_skb+0x14/0x80
    [ 7249.868129] [] ip_push_pending_frames+0x34/0x40
    [ 7249.868163] [] ip_send_unicast_reply+0x282/0x310
    [ 7249.868196] [] tcp_v4_send_reset+0x1b3/0x380
    [ 7249.868227] [] tcp_v4_rcv+0x323/0x990
    [ 7249.868257] [] ? nf_iterate+0x71/0x80
    [ 7249.868289] [] ip_local_deliver_finish+0x8b/0x230
    [ 7249.868322] [] ip_local_deliver+0x4c/0xa0
    [ 7249.868353] [] ? ip_rcv_finish+0x390/0x390
    [ 7249.868384] [] ip_rcv_finish+0x7c/0x390
    [ 7249.868415] [] ip_rcv+0x2e0/0x420
    ...

    Prior to the VRF change the oif was not set in the flow struct, so the
    VRF support should really have only added the vrf_master_ifindex lookup.

    Fixes: 613d09b30f8b ("net: Use VRF device index for lookups on TX")
    Cc: Andrey Melnikov
    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     
  • Pull NFS client bugfixes from Trond Myklebust:
    "Highlights include:

    Stable patches:
    - fix v4.2 SEEK on files over 2 gigs
    - Fix a layout segment reference leak when pNFS I/O falls back to inband I/O.
    - Fix recovery of recalled read delegations

    Bugfixes:
    - Fix a case where NFSv4 fails to send CLOSE after a server reboot
    - Fix sunrpc to wait for connections to complete before retrying
    - Fix sunrpc races between transport connect/disconnect and shutdown
    - Fix an infinite loop when layoutget fail with BAD_STATEID
    - nfs/filelayout: Fix NULL reference caused by double freeing of fh_array
    - Fix a bogus WARN_ON_ONCE() in O_DIRECT when layout commit_through_mds is set
    - Fix layoutreturn/close ordering issues"

    * tag 'nfs-for-4.3-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
    NFS41: make close wait for layoutreturn
    NFS: Skip checking ds_cinfo.buckets when lseg's commit_through_mds is set
    NFSv4.x/pnfs: Don't try to recover stateids twice in layoutget
    NFSv4: Recovery of recalled read delegations is broken
    NFS: Fix an infinite loop when layoutget fail with BAD_STATEID
    NFS: Do cleanup before resetting pageio read/write to mds
    SUNRPC: xs_sock_mark_closed() does not need to trigger socket autoclose
    SUNRPC: Lock the transport layer on shutdown
    nfs/filelayout: Fix NULL reference caused by double freeing of fh_array
    SUNRPC: Ensure that we wait for connections to complete before retrying
    SUNRPC: drop null test before destroy functions
    nfs: fix v4.2 SEEK on files over 2 gigs
    SUNRPC: Fix races between socket connection and destroy code
    nfs: fix pg_test page count calculation
    Failing to send a CLOSE if file is opened WRONLY and server reboots on a 4.x mount

    Linus Torvalds
     

25 Sep, 2015

10 commits

  • of_find_net_device_by_node() uses class_find_device() internally to
    lookup the corresponding network device. class_find_device() returns
    a reference to the embedded struct device, with its refcount
    incremented.

    Add a comment to the definition in net/core/net-sysfs.c indicating the
    need to drop this refcount, and fix the DSA code to drop this refcount
    when the OF-generated platform data is cleaned up and freed. Also
    arrange for the ref to be dropped when handling errors.

    Signed-off-by: Russell King
    Signed-off-by: David S. Miller

    Russell King
     
  • Current users of of_mdio_find_bus() leak a struct device refcount, as
    they fail to clean up the reference obtained inside class_find_device().

    Fix the DSA code to properly refcount the returned MDIO bus by:
    1. taking a reference on the struct device whenever we assign it to
    pd->chip[x].host_dev.
    2. dropping the reference when we overwrite the existing reference.
    3. dropping the reference when we free the data structure.
    4. dropping the initial reference we obtained after setting up the
    platform data structure, or on failure.

    In step 2 above, where we obtain a new MDIO bus, there is no need to
    take a reference on it as we would only have to drop it immediately
    after assignment again, iow:

    put_device(cd->host_dev); /* drop original assignment ref */
    cd->host_dev = get_device(&mdio_bus_switch->dev); /* get our ref */
    put_device(&mdio_bus_switch->dev); /* drop of_mdio_find_bus ref */

    Signed-off-by: Russell King
    Signed-off-by: David S. Miller

    Russell King
     
  • Currently error log messages in ip6_tnl_err are printed at 'warn'
    level. This is different to other tunnel types which don't print
    any messages. These log messages don't provide any information that
    couldn't be deduced with networking tools. Also it can be annoying
    to have one end of the tunnel go down and have the logs fill with
    pointless messages such as "Path to destination invalid or inactive!".

    This patch reduces the log level of these messages to 'dbg' level to
    bring the visible behaviour into line with other tunnel types.

    Signed-off-by: Matt Bennett
    Signed-off-by: David S. Miller

    Matt Bennett
     
  • …kernel/git/jberg/mac80211

    Johannes Berg says:

    ====================
    Just two small fixes:
    * VHT MCS mask array overrun, reported by Dan Carpenter
    * reset CQM history to always get a notification, from Sara Sharon
    ====================

    Signed-off-by: David S. Miller <davem@davemloft.net>

    David S. Miller
     
  • Currently error log messages in ip6gre_err are printed at 'warn'
    level. This is different to most other tunnel types which don't
    print any messages. These log messages don't provide any information
    that couldn't be deduced with networking tools. Also it can be annoying
    to have one end of the tunnel go down and have the logs fill with
    pointless messages such as "Path to destination invalid or inactive!".

    This patch reduces the log level of these messages to 'dbg' level to
    bring the visible behaviour into line with other tunnel types.

    Signed-off-by: Matt Bennett
    Signed-off-by: David S. Miller

    Matt Bennett
     
  • dump_rules returns skb length and not error.
    But when family == AF_UNSPEC, the caller of dump_rules
    assumes that it returns an error. Hence, when family == AF_UNSPEC,
    we continue trying to dump on -EMSGSIZE errors resulting in
    incorrect dump idx carried between skbs belonging to the same dump.
    This results in fib rule dump always only dumping rules that fit
    into the first skb.

    This patch fixes dump_rules to return error so that we exit correctly
    and idx is correctly maintained between skbs that are part of the
    same dump.

    Signed-off-by: Wilson Kok
    Signed-off-by: Roopa Prabhu
    Signed-off-by: David S. Miller

    Wilson Kok
     
  • fw filter uses tp->root==NULL to check if it is the old method,
    so it doesn't need allocation at all in this case. This patch
    reverts the offending commit and adds some comments for old
    method to make it obvious.

    Fixes: 33f8b9ecdb15 ("net_sched: move tp->root allocation into fw_init()")
    Reported-by: Akshat Kakkar
    Cc: Jamal Hadi Salim
    Signed-off-by: Cong Wang
    Acked-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    WANG Cong
     
  • The UDP tunnel config is asymmetric wrt. to the ports used. The source and
    destination ports from one direction of the tunnel are not related to the
    ports of the other direction. We need to be able to respond to ARP requests
    using the correct ports without involving routing.

    As the consequence, UDP ports need to be fixed property of the tunnel
    interface and cannot be set per route. Remove the ability to set ports per
    route. This is still okay to do, as no kernel has been released with these
    attributes yet.

    Note that the ability to specify source and destination ports is preserved
    for other users of the lwtunnel API which don't use routes for tunnel key
    specification (like openvswitch).

    If in the future we rework ARP handling to allow port specification, the
    attributes can be added back.

    Signed-off-by: Jiri Benc
    Acked-by: Thomas Graf
    Signed-off-by: David S. Miller

    Jiri Benc
     
  • When using ip lwtunnels, the additional data for xmit (basically, the actual
    tunnel to use) are carried in ip_tunnel_info either in dst->lwtstate or in
    metadata dst. When replying to ARP requests, we need to send the reply to
    the same tunnel the request came from. This means we need to construct
    proper metadata dst for ARP replies.

    We could perform another route lookup to get a dst entry with the correct
    lwtstate. However, this won't always ensure that the outgoing tunnel is the
    same as the incoming one, and it won't work anyway for IPv4 duplicate
    address detection.

    The only thing to do is to "reverse" the ip_tunnel_info.

    Signed-off-by: Jiri Benc
    Acked-by: Thomas Graf
    Signed-off-by: David S. Miller

    Jiri Benc
     
  • On Mon, Sep 21, 2015 at 02:20:22PM -0400, Tejun Heo wrote:
    >
    > store_release and load_acquire are different from the usual memory
    > barriers and can't be paired this way. You have to pair store_release
    > and load_acquire. Besides, it isn't a particularly good idea to

    OK I've decided to drop the acquire/release helpers as they don't
    help us at all and simply pessimises the code by using full memory
    barriers (on some architectures) where only a write or read barrier
    is needed.

    > depend on memory barriers embedded in other data structures like the
    > above. Here, especially, rhashtable_insert() would have write barrier
    > *before* the entry is hashed not necessarily *after*, which means that
    > in the above case, a socket which appears to have set bound to a
    > reader might not visible when the reader tries to look up the socket
    > on the hashtable.

    But you are right we do need an explicit write barrier here to
    ensure that the hashing is visible.

    > There's no reason to be overly smart here. This isn't a crazy hot
    > path, write barriers tend to be very cheap, store_release more so.
    > Please just do smp_store_release() and note what it's paired with.

    It's not about being overly smart. It's about actually understanding
    what's going on with the code. I've seen too many instances of
    people simply sprinkling synchronisation primitives around without
    any knowledge of what is happening underneath, which is just a recipe
    for creating hard-to-debug races.

    > > @@ -1539,7 +1546,7 @@ static int netlink_bind(struct socket *sock, struct sockaddr *addr,
    > > }
    > > }
    > >
    > > - if (!nlk->portid) {
    > > + if (!nlk->bound) {
    >
    > I don't think you can skip load_acquire here just because this is the
    > second deref of the variable. That doesn't change anything. Race
    > condition could still happen between the first and second tests and
    > skipping the second would lead to the same kind of bug.

    The reason this one is OK is because we do not use nlk->portid or
    try to get nlk from the hash table before we return to user-space.

    However, there is a real bug here that none of these acquire/release
    helpers discovered. The two bound tests here used to be a single
    one. Now that they are separate it is entirely possible for another
    thread to come in the middle and bind the socket. So we need to
    repeat the portid check in order to maintain consistency.

    > > @@ -1587,7 +1594,7 @@ static int netlink_connect(struct socket *sock, struct sockaddr *addr,
    > > !netlink_allowed(sock, NL_CFG_F_NONROOT_SEND))
    > > return -EPERM;
    > >
    > > - if (!nlk->portid)
    > > + if (!nlk->bound)
    >
    > Don't we need load_acquire here too? Is this path holding a lock
    > which makes that unnecessary?

    Ditto.

    ---8bound once in netlink_bind fixes
    a race where two threads that bind the socket at the same time
    with different port IDs may both succeed.

    Fixes: 1f770c0a09da ("netlink: Fix autobind race condition that leads to zero port ID")
    Reported-by: Tejun Heo
    Reported-by: Linus Torvalds
    Signed-off-by: Herbert Xu
    Nacked-by: Tejun Heo
    Signed-off-by: David S. Miller

    Herbert Xu
     

24 Sep, 2015

3 commits

  • Commit 7d82410950aa ("virtio: add explicit big-endian support to memory
    accessors") accidentally changed the virtio_net header used by
    AF_PACKET with PACKET_VNET_HDR from host-endian to big-endian.

    Since virtio_legacy_is_little_endian() is a very long identifier,
    define a vio_le macro and use that throughout the code instead of the
    hard-coded 'false' for little-endian.

    This restores the ABI to match 4.1 and earlier kernels, and makes my
    test program work again.

    Signed-off-by: David Woodhouse
    Signed-off-by: David S. Miller

    David Woodhouse
     
  • Drivers might call napi_disable while not holding the napi instance poll_lock.
    In those instances, its possible for a race condition to exist between
    poll_one_napi and napi_disable. That is to say, poll_one_napi only tests the
    NAPI_STATE_SCHED bit to see if there is work to do during a poll, and as such
    the following may happen:

    CPU0 CPU1
    ndo_tx_timeout napi_poll_dev
    napi_disable poll_one_napi
    test_and_set_bit (ret 0)
    test_bit (ret 1)
    reset adapter napi_poll_routine

    If the adapter gets a tx timeout without a napi instance scheduled, its possible
    for the adapter to think it has exclusive access to the hardware (as the napi
    instance is now scheduled via the napi_disable call), while the netpoll code
    thinks there is simply work to do. The result is parallel hardware access
    leading to corrupt data structures in the driver, and a crash.

    Additionaly, there is another, more critical race between netpoll and
    napi_disable. The disabled napi state is actually identical to the scheduled
    state for a given napi instance. The implication being that, if a napi instance
    is disabled, a netconsole instance would see the napi state of the device as
    having been scheduled, and poll it, likely while the driver was dong something
    requiring exclusive access. In the case above, its fairly clear that not having
    the rings in a state ready to be polled will cause any number of crashes.

    The fix should be pretty easy. netpoll uses its own bit to indicate that that
    the napi instance is in a state of being serviced by netpoll (NAPI_STATE_NPSVC).
    We can just gate disabling on that bit as well as the sched bit. That should
    prevent netpoll from conducting a napi poll if we convert its set bit to a
    test_and_set_bit operation to provide mutual exclusion

    Change notes:
    V2)
    Remove a trailing whtiespace
    Resubmit with proper subject prefix

    V3)
    Clean up spacing nits

    Signed-off-by: Neil Horman
    CC: "David S. Miller"
    CC: jmaxwell@redhat.com
    Tested-by: jmaxwell@redhat.com
    Signed-off-by: David S. Miller

    Neil Horman
     
  • RST packets sent on behalf of TCP connections with TS option (RFC 7323
    TCP timestamps) have incorrect TS val (set to 0), but correct TS ecr.

    A > B: Flags [S], seq 0, win 65535, options [mss 1000,nop,nop,TS val 100
    ecr 0], length 0
    B > A: Flags [S.], seq 2444755794, ack 1, win 28960, options [mss
    1460,nop,nop,TS val 7264344 ecr 100], length 0
    A > B: Flags [.], ack 1, win 65535, options [nop,nop,TS val 110 ecr
    7264344], length 0

    B > A: Flags [R.], seq 1, ack 1, win 28960, options [nop,nop,TS val 0
    ecr 110], length 0

    We need to call skb_mstamp_get() to get proper TS val,
    derived from skb->skb_mstamp

    Note that RFC 1323 was advocating to not send TS option in RST segment,
    but RFC 7323 recommends the opposite :

    Once TSopt has been successfully negotiated, that is both and
    contain TSopt, the TSopt MUST be sent in every non-
    segment for the duration of the connection, and SHOULD be sent in an
    segment (see Section 5.2 for details)

    Note this RFC recommends to send TS val = 0, but we believe it is
    premature : We do not know if all TCP stacks are properly
    handling the receive side :

    When an segment is
    received, it MUST NOT be subjected to the PAWS check by verifying an
    acceptable value in SEG.TSval, and information from the Timestamps
    option MUST NOT be used to update connection state information.
    SEG.TSecr MAY be used to provide stricter acceptance checks.

    In 5 years, if/when all TCP stack are RFC 7323 ready, we might consider
    to decide to send TS val = 0, if it buys something.

    Fixes: 7faee5c0d514 ("tcp: remove TCP_SKB_CB(skb)->when")
    Signed-off-by: Eric Dumazet
    Acked-by: Yuchung Cheng
    Signed-off-by: David S. Miller

    Eric Dumazet
     

23 Sep, 2015

3 commits

  • The Marvell Egress rx trailer check must be fixed to
    correctly detect bad bits in the third byte of the
    Eggress trailer as described in the Table 28 of the
    88E6060 datasheet.
    The current code incorrectly omits to check the third
    byte and checks the fourth byte twice.

    Signed-off-by: Neil Armstrong
    Acked-by: Guenter Roeck
    Signed-off-by: David S. Miller

    Neil Armstrong
     
  • When support for megaflows was introduced, OVS needed to start
    installing flows with a mask applied to them. Since masking is an
    expensive operation, OVS also had an optimization that would only
    take the parts of the flow keys that were covered by a non-zero
    mask. The values stored in the remaining pieces should not matter
    because they are masked out.

    While this works fine for the purposes of matching (which must always
    look at the mask), serialization to netlink can be problematic. Since
    the flow and the mask are serialized separately, the uninitialized
    portions of the flow can be encoded with whatever values happen to be
    present.

    In terms of functionality, this has little effect since these fields
    will be masked out by definition. However, it leaks kernel memory to
    userspace, which is a potential security vulnerability. It is also
    possible that other code paths could look at the masked key and get
    uninitialized data, although this does not currently appear to be an
    issue in practice.

    This removes the mask optimization for flows that are being installed.
    This was always intended to be the case as the mask optimizations were
    really targetting per-packet flow operations.

    Fixes: 03f0d916 ("openvswitch: Mega flow implementation")
    Signed-off-by: Jesse Gross
    Acked-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    Jesse Gross
     
  • This reverts commit 51360155eccb907ff8635bd10fc7de876408c2e0 and adapts
    fs/userfaultfd.c to use the old version of that function.

    It didn't look robust to call __wake_up_common with "nr == 1" when we
    absolutely require wakeall semantics, but we've full control of what we
    insert in the two waitqueue heads of the blocked userfaults. No
    exclusive waitqueue risks to be inserted into those two waitqueue heads
    so we can as well stick to "nr == 1" of the old code and we can rely
    purely on the fact no waitqueue inserted in one of the two waitqueue
    heads we must enforce as wakeall, has wait->flags WQ_FLAG_EXCLUSIVE set.

    Signed-off-by: Andrea Arcangeli
    Cc: Dr. David Alan Gilbert
    Cc: Michael Ellerman
    Cc: Shuah Khan
    Cc: Thierry Reding
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

22 Sep, 2015

4 commits

  • The current behavior of notifying CQM events is inconsistent:
    Upon first configuration there is a cqm event with the current
    status according to threshold configured, regardless of signal
    stability.
    When there is reconfiguration no event is sent unless there is
    a significant change to the signal level according to the new
    configuration.

    Since the current reconfiguration behavior might cause missing
    CQM events in case the current signal did not change but is on
    the other side of the new threshold, fix that by resetting the
    stored signal level upon reconfiguration.

    Signed-off-by: Sara Sharon
    Signed-off-by: Luca Coelho
    Signed-off-by: Johannes Berg

    Sara Sharon
     
  • The HT MCS mask has 9 bytes, the VHT one only has 8 streams.
    Split the loops to handle this correctly.

    Reported-by: Dan Carpenter
    Signed-off-by: Johannes Berg

    Johannes Berg
     
  • Before allowing lockless LISTEN processing, we need to make
    sure to arm the SYN_RECV timer before the req socket is visible
    in hash tables.

    Also, req->rsk_hash should be written before we set rsk_refcnt
    to a non zero value.

    Fixes: fa76ce7328b2 ("inet: get rid of central tcp/dccp listener timer")
    Signed-off-by: Eric Dumazet
    Cc: Ying Cai
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • When creating a timewait socket, we need to arm the timer before
    allowing other cpus to find it. The signal allowing cpus to find
    the socket is setting tw_refcnt to non zero value.

    As we set tw_refcnt in __inet_twsk_hashdance(), we therefore need to
    call inet_twsk_schedule() first.

    This also means we need to remove tw_refcnt changes from
    inet_twsk_schedule() and let the caller handle it.

    Note that because we use mod_timer_pinned(), we have the guarantee
    the timer wont expire before we set tw_refcnt as we run in BH context.

    To make things more readable I introduced inet_twsk_reschedule() helper.

    When rearming the timer, we can use mod_timer_pending() to make sure
    we do not rearm a canceled timer.

    Note: This bug can possibly trigger if packets of a flow can hit
    multiple cpus. This does not normally happen, unless flow steering
    is broken somehow. This explains this bug was spotted ~5 months after
    its introduction.

    A similar fix is needed for SYN_RECV sockets in reqsk_queue_hash_req(),
    but will be provided in a separate patch for proper tracking.

    Fixes: 789f558cfb36 ("tcp/dccp: get rid of central timewait timer")
    Signed-off-by: Eric Dumazet
    Reported-by: Ying Cai
    Signed-off-by: David S. Miller

    Eric Dumazet
     

21 Sep, 2015

5 commits

  • The commit c0bb07df7d981e4091432754e30c9c720e2c0c78 ("netlink:
    Reset portid after netlink_insert failure") introduced a race
    condition where if two threads try to autobind the same socket
    one of them may end up with a zero port ID. This led to kernel
    deadlocks that were observed by multiple people.

    This patch reverts that commit and instead fixes it by introducing
    a separte rhash_portid variable so that the real portid is only set
    after the socket has been successfully hashed.

    Fixes: c0bb07df7d98 ("netlink: Reset portid after netlink_insert failure")
    Reported-by: Tejun Heo
    Reported-by: Linus Torvalds
    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     
  • This was already done a long time ago in
    commit 64194c31a0b6 ("inet: Make tunnel RX/TX byte counters more consistent")
    but tx path was broken (at least since 3.10).

    Before the patch the gre header was included on tx.

    After the patch:
    $ ping -c1 192.168.0.121 ; ip -s l ls dev gre1
    PING 192.168.0.121 (192.168.0.121) 56(84) bytes of data.
    64 bytes from 192.168.0.121: icmp_req=1 ttl=64 time=2.95 ms

    --- 192.168.0.121 ping statistics ---
    1 packets transmitted, 1 received, 0% packet loss, time 0ms
    rtt min/avg/max/mdev = 2.955/2.955/2.955/0.000 ms
    7: gre1@NONE: mtu 1468 qdisc noqueue state UNKNOWN mode DEFAULT group default
    link/gre 10.16.0.249 peer 10.16.0.121
    RX: bytes packets errors dropped overrun mcast
    84 1 0 0 0 0
    TX: bytes packets errors dropped carrier collsns
    84 1 0 0 0 0

    Reported-by: Julien Meunier
    Signed-off-by: Nicolas Dichtel
    Signed-off-by: David S. Miller

    Nicolas Dichtel
     
  • Pablo Neira Ayuso says:

    ====================
    Netfilter fixes for net

    The following patch contains Netfilter fixes for your net tree, they are:

    1) nf_log_unregister() should only set to NULL the logger that is being
    unregistered, instead of everything else. Patch from Florian Westphal.

    2) Fix a crash when accessing physoutdev from PREROUTING in br_netfilter.
    This is partially reverting the patch to shrink nf_bridge_info to 32 bytes.
    Also from Florian.

    3) Use existing match/target extensions in the internal nft_compat extension
    lists when the extension is family unspecific (ie. NFPROTO_UNSPEC).

    4) Wait for rcu grace period before leaving nf_log_unregister().
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • The msg pointer into header may change after skb linearization.
    We must reinitialize it after calling skb_linearize to prevent
    operating on a freed or invalid pointer.

    Signed-off-by: Erik Hugne
    Reported-by: Tamás Végh
    Acked-by: Ying Xue
    Signed-off-by: David S. Miller

    Erik Hugne
     
  • Man page of ip-route(8) says following about route types:

    unreachable - these destinations are unreachable. Packets are dis‐
    carded and the ICMP message host unreachable is generated. The local
    senders get an EHOSTUNREACH error.

    blackhole - these destinations are unreachable. Packets are dis‐
    carded silently. The local senders get an EINVAL error.

    prohibit - these destinations are unreachable. Packets are discarded
    and the ICMP message communication administratively prohibited is
    generated. The local senders get an EACCES error.

    In the inet6 address family, this was correct, except the local senders
    got ENETUNREACH error instead of EHOSTUNREACH in case of unreachable route.
    In the inet address family, all three route types generated ICMP message
    net unreachable, and the local senders got ENETUNREACH error.

    In both address families all three route types now behave consistently
    with documentation.

    Signed-off-by: Nikola Forró
    Signed-off-by: David S. Miller

    Nikola Forró
     

20 Sep, 2015

2 commits


18 Sep, 2015

10 commits

  • Tracking idle time in bictcp_cwnd_event() is imprecise, as epoch_start
    is normally set at ACK processing time, not at send time.

    Doing a proper fix would need to add an additional state variable,
    and does not seem worth the trouble, given CUBIC bug has been there
    forever before Jana noticed it.

    Let's simply not set epoch_start in the future, otherwise
    bictcp_update() could overflow and CUBIC would again
    grow cwnd too fast.

    This was detected thanks to a packetdrill test Neal wrote that was flaky
    before applying this fix.

    Fixes: 30927520dbae ("tcp_cubic: better follow cubic curve after idle period")
    Signed-off-by: Eric Dumazet
    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Cc: Jana Iyengar
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Johan Hedberg says:

    ====================
    pull request: bluetooth 2015-09-17

    Here's one important patch for the 4.3-rc series that fixes an issue
    with Bluetooth LE encryption failing because of a too early check for
    the SMP context.

    Please let me know if there are any issues pulling. Thanks.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • If we didn't call ATMARP_MKIP before ATMARP_ENCAP the VCC descriptor is
    non-existant and we'll end up dereferencing a NULL ptr:

    [1033173.491930] kasan: GPF could be caused by NULL-ptr deref or user memory accessirq event stamp: 123386
    [1033173.493678] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN
    [1033173.493689] Modules linked in:
    [1033173.493697] CPU: 9 PID: 23815 Comm: trinity-c64 Not tainted 4.2.0-next-20150911-sasha-00043-g353d875-dirty #2545
    [1033173.493706] task: ffff8800630c4000 ti: ffff880063110000 task.ti: ffff880063110000
    [1033173.493823] RIP: clip_ioctl (net/atm/clip.c:320 net/atm/clip.c:689)
    [1033173.493826] RSP: 0018:ffff880063117a88 EFLAGS: 00010203
    [1033173.493828] RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 000000000000000c
    [1033173.493830] RDX: 0000000000000002 RSI: ffffffffb3f10720 RDI: 0000000000000014
    [1033173.493832] RBP: ffff880063117b80 R08: ffff88047574d9a4 R09: 0000000000000000
    [1033173.493834] R10: 0000000000000000 R11: 0000000000000000 R12: 1ffff1000c622f53
    [1033173.493836] R13: ffff8800cb905500 R14: ffff8808d6da2000 R15: 00000000fffffdfd
    [1033173.493840] FS: 00007fa56b92d700(0000) GS:ffff880478000000(0000) knlGS:0000000000000000
    [1033173.493843] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    [1033173.493845] CR2: 0000000000000000 CR3: 00000000630e8000 CR4: 00000000000006a0
    [1033173.493855] Stack:
    [1033173.493862] ffffffffb0b60444 000000000000eaea 0000000041b58ab3 ffffffffb3c3ce32
    [1033173.493867] ffffffffb0b6f3e0 ffffffffb0b60444 ffffffffb5ea2e50 1ffff1000c622f5e
    [1033173.493873] ffff8800630c4cd8 00000000000ee09a ffffffffb3ec4888 ffffffffb5ea2de8
    [1033173.493874] Call Trace:
    [1033173.494108] do_vcc_ioctl (net/atm/ioctl.c:170)
    [1033173.494113] vcc_ioctl (net/atm/ioctl.c:189)
    [1033173.494116] svc_ioctl (net/atm/svc.c:605)
    [1033173.494200] sock_do_ioctl (net/socket.c:874)
    [1033173.494204] sock_ioctl (net/socket.c:958)
    [1033173.494244] do_vfs_ioctl (fs/ioctl.c:43 fs/ioctl.c:607)
    [1033173.494290] SyS_ioctl (fs/ioctl.c:622 fs/ioctl.c:613)
    [1033173.494295] entry_SYSCALL_64_fastpath (arch/x86/entry/entry_64.S:186)
    [1033173.494362] Code: fa 48 c1 ea 03 80 3c 02 00 0f 85 50 09 00 00 49 8b 9e 60 06 00 00 48 b8 00 00 00 00 00 fc ff df 48 8d 7b 14 48 89 fa 48 c1 ea 03 b6 04 02 48 89 fa 83 e2 07 38 d0 7f 08 84 c0 0f 85 14 09 00
    All code

    ========
    0: fa cli
    1: 48 c1 ea 03 shr $0x3,%rdx
    5: 80 3c 02 00 cmpb $0x0,(%rdx,%rax,1)
    9: 0f 85 50 09 00 00 jne 0x95f
    f: 49 8b 9e 60 06 00 00 mov 0x660(%r14),%rbx
    16: 48 b8 00 00 00 00 00 movabs $0xdffffc0000000000,%rax
    1d: fc ff df
    20: 48 8d 7b 14 lea 0x14(%rbx),%rdi
    24: 48 89 fa mov %rdi,%rdx
    27: 48 c1 ea 03 shr $0x3,%rdx
    2b:* 0f b6 04 02 movzbl (%rdx,%rax,1),%eax

    Signed-off-by: Sasha Levin
    Signed-off-by: David S. Miller

    Sasha Levin
     
  • David Woodhouse reports skb_under_panic when we try to push ethernet
    header to fragmented ipv6 skbs:

    skbuff: skb_under_panic: text:c1277f1e len:1294 put:14 head:dec98000
    data:dec97ffc tail:0xdec9850a end:0xdec98f40 dev:br-lan
    [..]
    ip6_finish_output2+0x196/0x4da

    David further debugged this:
    [..] offending fragments were arriving here with skb_headroom(skb)==10.
    Which is reasonable, being the Solos ADSL card's header of 8 bytes
    followed by 2 bytes of PPP frame type.

    The problem is that if netfilter ipv6 defragmentation is used, skb_cow()
    in ip6_forward will only see reassembled skb.

    Therefore, headroom is overestimated by 8 bytes (we pulled fragment
    header) and we don't check the skbs in the frag_list either.

    We can't do these checks in netfilter defrag since outdev isn't known yet.

    Furthermore, existing tests in ip6_fragment did not consider the fragment
    or ipv6 header size when checking headroom of the fraglist skbs.

    While at it, also fix a skb leak on memory allocation -- ip6_fragment
    must consume the skb.

    I tested this e1000 driver hacked to not allocate additional headroom
    (we end up in slowpath, since LL_RESERVED_SPACE is 16).

    If 2 bytes of headroom are allocated, fastpath is taken (14 byte
    ethernet header was pulled, so 16 byte headroom available in all
    fragments).

    Reported-by: David Woodhouse
    Diagnosed-by: David Woodhouse
    Signed-off-by: Florian Westphal
    Tested-by: David Woodhouse
    Signed-off-by: David S. Miller

    Florian Westphal
     
  • Steffen reported that the recent change to add oif to dst lookups breaks
    the VTI use case. The problem is that with the oif set in the flow struct
    the comparison to the nh_oif is triggered. Fix by splitting the
    FLOWI_FLAG_VRFSRC into 2 flags -- one that triggers the vrf device cache
    bypass (FLOWI_FLAG_VRFSRC) and another telling the lookup to not compare
    nh oif (FLOWI_FLAG_SKIP_NH_OIF).

    Fixes: 42a7b32b73d6 ("xfrm: Add oif to dst lookups")

    Signed-off-by: David Ahern
    Acked-by: Steffen Klassert
    Signed-off-by: David S. Miller

    David Ahern
     
  • Static code analysis reveals the following bug:

    net/openvswitch/conntrack.c:281 ovs_ct_helper()
    warn: unsigned 'protoff' is never less than zero.

    This signedness bug breaks error handling for IPv6 extension headers when
    using conntrack helpers. Fix the error by using a local signed variable.

    Fixes: cae3a2627520: "openvswitch: Allow attaching helpers to ct
    action"
    Reported-by: Dan Carpenter
    Signed-off-by: Joe Stringer
    Acked-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    Joe Stringer
     
  • Commit 718ba5b87343, moved the responsibility for unlocking the socket to
    xs_tcp_setup_socket, meaning that the socket will be unlocked before we
    know that it has finished trying to connect. The following patch is based on
    an initial patch by Russell King to ensure that we delay clearing the
    XPRT_CONNECTING flag until we either know that we failed to initiate
    a connection attempt, or the connection attempt itself failed.

    Fixes: 718ba5b87343 ("SUNRPC: Add helpers to prevent socket create from racing")
    Reported-by: Russell King
    Reported-by: Russell King
    Tested-by: Russell King
    Tested-by: Benjamin Coddington
    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • This patch adds NLM_F_REPLACE flag to ipv6 route replace notifications.
    This makes nlm_flags in ipv6 replace notifications consistent
    with ipv4.

    Signed-off-by: Roopa Prabhu
    Acked-by: Nicolas Dichtel
    Reviewed-by: Michal Kubecek
    Signed-off-by: David S. Miller

    Roopa Prabhu
     
  • Remove unneeded NULL test.

    The semantic patch that makes this change is as follows:
    (http://coccinelle.lip6.fr/)

    //
    @@ expression x; @@
    -if (x != NULL)
    \(kmem_cache_destroy\|mempool_destroy\|dma_pool_destroy\)(x);
    //

    Signed-off-by: Julia Lawall
    Signed-off-by: Trond Myklebust

    Julia Lawall
     
  • When we're destroying the socket transport, we need to ensure that
    we cancel any existing delayed connection attempts, and order them
    w.r.t. the call to xs_close().

    Reported-by:"Suzuki K. Poulose"
    Acked-by: Jeff Layton
    Signed-off-by: Trond Myklebust

    Trond Myklebust