06 Apr, 2016

1 commit


05 Apr, 2016

3 commits


02 Apr, 2016

2 commits

  • Pull networking fixes from David Miller:

    1) Missing device reference in IPSEC input path results in crashes
    during device unregistration. From Subash Abhinov Kasiviswanathan.

    2) Per-queue ISR register writes not being done properly in macb
    driver, from Cyrille Pitchen.

    3) Stats accounting bugs in bcmgenet, from Patri Gynther.

    4) Lightweight tunnel's TTL and TOS were swapped in netlink dumps, from
    Quentin Armitage.

    5) SXGBE driver has off-by-one in probe error paths, from Rasmus
    Villemoes.

    6) Fix race in save/swap/delete options in netfilter ipset, from
    Vishwanath Pai.

    7) Ageing time of bridge not set properly when not operating over a
    switchdev device. Fix from Haishuang Yan.

    8) Fix GRO regression wrt nested FOU/GUE based tunnels, from Alexander
    Duyck.

    9) IPV6 UDP code bumps wrong stats, from Eric Dumazet.

    10) FEC driver should only access registers that actually exist on the
    given chipset, fix from Fabio Estevam.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (73 commits)
    net: mvneta: fix changing MTU when using per-cpu processing
    stmmac: fix MDIO settings
    Revert "stmmac: Fix 'eth0: No PHY found' regression"
    stmmac: fix TX normal DESC
    net: mvneta: use cache_line_size() to get cacheline size
    net: mvpp2: use cache_line_size() to get cacheline size
    net: mvpp2: fix maybe-uninitialized warning
    tun, bpf: fix suspicious RCU usage in tun_{attach, detach}_filter
    net: usb: cdc_ncm: adding Telit LE910 V2 mobile broadband card
    rtnl: fix msg size calculation in if_nlmsg_size()
    fec: Do not access unexisting register in Coldfire
    net: mvneta: replace MVNETA_CPU_D_CACHE_LINE_SIZE with L1_CACHE_BYTES
    net: mvpp2: replace MVPP2_CPU_D_CACHE_LINE_SIZE with L1_CACHE_BYTES
    net: dsa: mv88e6xxx: Clear the PDOWN bit on setup
    net: dsa: mv88e6xxx: Introduce _mv88e6xxx_phy_page_{read, write}
    bpf: make padding in bpf_tunnel_key explicit
    ipv6: udp: fix UDP_MIB_IGNOREDMULTI updates
    bnxt_en: Fix ethtool -a reporting.
    bnxt_en: Fix typo in bnxt_hwrm_set_pause_common().
    bnxt_en: Implement proper firmware message padding.
    ...

    Linus Torvalds
     
  • Sasha Levin reported a suspicious rcu_dereference_protected() warning
    found while fuzzing with trinity that is similar to this one:

    [ 52.765684] net/core/filter.c:2262 suspicious rcu_dereference_protected() usage!
    [ 52.765688] other info that might help us debug this:
    [ 52.765695] rcu_scheduler_active = 1, debug_locks = 1
    [ 52.765701] 1 lock held by a.out/1525:
    [ 52.765704] #0: (rtnl_mutex){+.+.+.}, at: [] rtnl_lock+0x17/0x20
    [ 52.765721] stack backtrace:
    [ 52.765728] CPU: 1 PID: 1525 Comm: a.out Not tainted 4.5.0+ #264
    [...]
    [ 52.765768] Call Trace:
    [ 52.765775] [] dump_stack+0x85/0xc8
    [ 52.765784] [] lockdep_rcu_suspicious+0xd5/0x110
    [ 52.765792] [] sk_detach_filter+0x82/0x90
    [ 52.765801] [] tun_detach_filter+0x35/0x90 [tun]
    [ 52.765810] [] __tun_chr_ioctl+0x354/0x1130 [tun]
    [ 52.765818] [] ? selinux_file_ioctl+0x130/0x210
    [ 52.765827] [] tun_chr_ioctl+0x13/0x20 [tun]
    [ 52.765834] [] do_vfs_ioctl+0x96/0x690
    [ 52.765843] [] ? security_file_ioctl+0x43/0x60
    [ 52.765850] [] SyS_ioctl+0x79/0x90
    [ 52.765858] [] do_syscall_64+0x62/0x140
    [ 52.765866] [] entry_SYSCALL64_slow_path+0x25/0x25

    Same can be triggered with PROVE_RCU (+ PROVE_RCU_REPEATEDLY) enabled
    from tun_attach_filter() when user space calls ioctl(tun_fd, TUN{ATTACH,
    DETACH}FILTER, ...) for adding/removing a BPF filter on tap devices.

    Since the fix in f91ff5b9ff52 ("net: sk_{detach|attach}_filter() rcu
    fixes") sk_attach_filter()/sk_detach_filter() now dereferences the
    filter with rcu_dereference_protected(), checking whether socket lock
    is held in control path.

    Since its introduction in 994051625981 ("tun: socket filter support"),
    tap filters are managed under RTNL lock from __tun_chr_ioctl(). Thus the
    sock_owned_by_user(sk) doesn't apply in this specific case and therefore
    triggers the false positive.

    Extend the BPF API with __sk_attach_filter()/__sk_detach_filter() pair
    that is used by tap filters and pass in lockdep_rtnl_is_held() for the
    rcu_dereference_protected() checks instead.

    Reported-by: Sasha Levin
    Signed-off-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

01 Apr, 2016

1 commit


31 Mar, 2016

5 commits

  • Make the 2 byte padding in struct bpf_tunnel_key between tunnel_ttl
    and tunnel_label members explicit. No issue has been observed, and
    gcc/llvm does padding for the old struct already, where tunnel_label
    was not yet present, so the current code works, but since it's part
    of uapi, make sure we don't introduce holes in structs.

    Therefore, add tunnel_ext that we can use generically in future
    (f.e. to flag OAM messages for backends, etc). Also add the offset
    to the compat tests to be sure should some compilers not padd the
    tail of the old version of bpf_tunnel_key.

    Fixes: 4018ab1875e0 ("bpf: support flow label for bpf_skb_{set, get}_tunnel_key")
    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • IPv6 counters updates use a different macro than IPv4.

    Fixes: 36cbb2452cbaf ("udp: Increment UDP_MIB_IGNOREDMULTI for arriving unmatched multicasts")
    Signed-off-by: Eric Dumazet
    Cc: Rick Jones
    Cc: Willem de Bruijn
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • This patch should fix the issues seen with a recent fix to prevent
    tunnel-in-tunnel frames from being generated with GRO. The fix itself is
    correct for now as long as we do not add any devices that support
    NETIF_F_GSO_GRE_CSUM. When such a device is added it could have the
    potential to mess things up due to the fact that the outer transport header
    points to the outer UDP header and not the GRE header as would be expected.

    Fixes: fac8e0f579695 ("tunnels: Don't apply GRO to multiple layers of encapsulation.")
    Signed-off-by: Alexander Duyck
    Signed-off-by: David S. Miller

    Alexander Duyck
     
  • Somehow my patch for commit cea8768f333e ("sctp: allow
    sctp_transmit_packet and others to use gfp") missed two important
    chunks, which are now added.

    Fixes: cea8768f333e ("sctp: allow sctp_transmit_packet and others to use gfp")
    Signed-off-by: Marcelo Ricardo Leitner
    Acked-By: Neil Horman
    Signed-off-by: David S. Miller

    Marcelo Ricardo Leitner
     
  • When NET_SWITCHDEV=n, switchdev_port_attr_set will return -EOPNOTSUPP,
    we should ignore this error code and continue to set the ageing time.

    Fixes: c62987bbd8a1 ("bridge: push bridge setting ageing_time down to switchdev")
    Signed-off-by: Haishuang Yan
    Acked-by: Ido Schimmel
    Signed-off-by: David S. Miller

    Haishuang Yan
     

29 Mar, 2016

1 commit

  • Pablo Neira Ayuso says:

    ====================
    Netfilter fixes for net

    The following patchset contains Netfilter fixes for you net tree,
    they are:

    1) There was a race condition between parallel save/swap and delete,
    which resulted a kernel crash due to the increase ref for save, swap,
    wrong ref decrease operations. Reported and fixed by Vishwanath Pai.

    2) OVS should call into CT NAT for packets of new expected connections only
    when the conntrack state is persisted with the 'commit' option to the
    OVS CT action. From Jarno Rajahalme.

    3) Resolve kconfig dependencies with new OVS NAT support. From Arnd Bergmann.

    4) Early validation of entry->target_offset to make sure it doesn't take us
    out from the blob, from Florian Westphal.

    5) Again early validation of entry->next_offset to make sure it doesn't take
    out from the blob, also from Florian.

    6) Check that entry->target_offset is always of of sizeof(struct xt_entry)
    for unconditional entries, when checking both from check_underflow()
    and when checking for loops in mark_source_chains(), again from
    Florian.

    7) Fix inconsistent behaviour in nfnetlink_queue when
    NFQA_CFG_F_FAIL_OPEN is set and netlink_unicast() fails due to buffer
    overrun, we have to reinject the packet as the user expects.

    8) Enforce nul-terminated table names from getsockopt GET_ENTRIES
    requests.

    9) Don't assume skb->sk is set from nft_bridge_reject and synproxy,
    this fixes a recent update of the code to namespaceify
    ip_default_ttl, patch from Liping Zhang.

    This batch comes with four patches to validate x_tables blobs coming
    from userspace. CONFIG_USERNS exposes the x_tables interface to
    unpriviledged users and to be honest this interface never received the
    attention for this move away from the CAP_NET_ADMIN domain. Florian is
    working on another round with more patches with more sanity checks, so
    expect a bit more Netfilter fixes in this development cycle than usual.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

28 Mar, 2016

11 commits

  • Commit fa50d974d104 ("ipv4: Namespaceify ip_default_ttl sysctl knob")
    use sock_net(skb->sk) to get the net namespace, but we can't assume
    that sk_buff->sk is always exist, so when it is NULL, oops will happen.

    Signed-off-by: Liping Zhang
    Reviewed-by: Nikolay Borisov
    Signed-off-by: Pablo Neira Ayuso

    Liping Zhang
     
  • Make sure the table names via getsockopt GET_ENTRIES is nul-terminated
    in ebtables and all the x_tables variants and their respective compat
    code. Uncovered by KASAN.

    Reported-by: Baozeng Ding
    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     
  • When netlink unicast fails to deliver the message to userspace, we
    should also check if the NFQA_CFG_F_FAIL_OPEN flag is set so we reinject
    the packet back to the stack.

    I think the user expects no packet drops when this flag is set due to
    queueing to userspace errors, no matter if related to the internal queue
    or when sending the netlink message to userspace.

    The userspace application will still get the ENOBUFS error via recvmsg()
    so the user still knows that, with the current configuration that is in
    place, the userspace application is not consuming the messages at the
    pace that the kernel needs.

    Reported-by: "Yigal Reiss (yreiss)"
    Signed-off-by: Pablo Neira Ayuso
    Tested-by: "Yigal Reiss (yreiss)"

    Pablo Neira Ayuso
     
  • Ben Hawkes says:

    In the mark_source_chains function (net/ipv4/netfilter/ip_tables.c) it
    is possible for a user-supplied ipt_entry structure to have a large
    next_offset field. This field is not bounds checked prior to writing a
    counter value at the supplied offset.

    Problem is that mark_source_chains should not have been called --
    the rule doesn't have a next entry, so its supposed to return
    an absolute verdict of either ACCEPT or DROP.

    However, the function conditional() doesn't work as the name implies.
    It only checks that the rule is using wildcard address matching.

    However, an unconditional rule must also not be using any matches
    (no -m args).

    The underflow validator only checked the addresses, therefore
    passing the 'unconditional absolute verdict' test, while
    mark_source_chains also tested for presence of matches, and thus
    proceeeded to the next (not-existent) rule.

    Unify this so that all the callers have same idea of 'unconditional rule'.

    Reported-by: Ben Hawkes
    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • Otherwise this function may read data beyond the ruleset blob.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • We should check that e->target_offset is sane before
    mark_source_chains gets called since it will fetch the target entry
    for loop detection.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • The openvswitch code has gained support for calling into the
    nf-nat-ipv4/ipv6 modules, however those can be loadable modules
    in a configuration in which openvswitch is built-in, leading
    to link errors:

    net/built-in.o: In function `__ovs_ct_lookup':
    :(.text+0x2cc2c8): undefined reference to `nf_nat_icmp_reply_translation'
    :(.text+0x2cc66c): undefined reference to `nf_nat_icmpv6_reply_translation'

    The dependency on (!NF_NAT || NF_NAT) prevents similar issues,
    but NF_NAT is set to 'y' if any of the symbols selecting
    it are built-in, but the link error happens when any of them
    are modular.

    A second issue is that even if CONFIG_NF_NAT_IPV6 is built-in,
    CONFIG_NF_NAT_IPV4 might be completely disabled. This is unlikely
    to be useful in practice, but the driver currently only handles
    IPv6 being optional.

    This patch improves the Kconfig dependency so that openvswitch
    cannot be built-in if either of the two other symbols are set
    to 'm', and it replaces the incorrect #ifdef in ovs_ct_nat_execute()
    with two "if (IS_ENABLED())" checks that should catch all corner
    cases also make the code more readable.

    The same #ifdef exists ovs_ct_nat_to_attr(), where it does not
    cause a link error, but for consistency I'm changing it the same
    way.

    Signed-off-by: Arnd Bergmann
    Fixes: 05752523e565 ("openvswitch: Interface with NAT.")
    Acked-by: Joe Stringer
    Signed-off-by: Pablo Neira Ayuso

    Arnd Bergmann
     
  • OVS should call into CT NAT for packets of new expected connections only
    when the conntrack state is persisted with the 'commit' option to the
    OVS CT action. The test for this condition is doubly wrong, as the CT
    status field is ANDed with the bit number (IPS_EXPECTED_BIT) rather
    than the mask (IPS_EXPECTED), and due to the wrong assumption that the
    expected bit would apply only for the first (i.e., 'new') packet of a
    connection, while in fact the expected bit remains on for the lifetime of
    an expected connection. The 'ctinfo' value IP_CT_RELATED derived from
    the ct status can be used instead, as it is only ever applicable to
    the 'new' packets of the expected connection.

    Fixes: 05752523e565 ('openvswitch: Interface with NAT.')
    Reported-by: Dan Carpenter
    Signed-off-by: Jarno Rajahalme
    Signed-off-by: Pablo Neira Ayuso

    Jarno Rajahalme
     
  • This fix adds a new reference counter (ref_netlink) for the struct ip_set.
    The other reference counter (ref) can be swapped out by ip_set_swap and we
    need a separate counter to keep track of references for netlink events
    like dump. Using the same ref counter for dump causes a race condition
    which can be demonstrated by the following script:

    ipset create hash_ip1 hash:ip family inet hashsize 1024 maxelem 500000 \
    counters
    ipset create hash_ip2 hash:ip family inet hashsize 300000 maxelem 500000 \
    counters
    ipset create hash_ip3 hash:ip family inet hashsize 1024 maxelem 500000 \
    counters

    ipset save &

    ipset swap hash_ip3 hash_ip2
    ipset destroy hash_ip3 /* will crash the machine */

    Swap will exchange the values of ref so destroy will see ref = 0 instead of
    ref = 1. With this fix in place swap will not succeed because ipset save
    still has ref_netlink on the set (ip_set_swap doesn't swap ref_netlink).

    Both delete and swap will error out if ref_netlink != 0 on the set.

    Note: The changes to *_head functions is because previously we would
    increment ref whenever we called these functions, we don't do that
    anymore.

    Reviewed-by: Joshua Hunt
    Signed-off-by: Vishwanath Pai
    Signed-off-by: Jozsef Kadlecsik
    Signed-off-by: Pablo Neira Ayuso

    Vishwanath Pai
     
  • For the input parameter count, it's better to use the size
    of destination buffer size, as nla_memcpy would take into
    account the length of the source netlink attribute when
    a data is copied from an attribute.

    Signed-off-by: Haishuang Yan
    Signed-off-by: David S. Miller

    Haishuang Yan
     
  • For a route with IPv6 encapsulation, the traffic class and hop limit
    values are interchanged when returned to userspace by the kernel.
    For example, see below.

    ># ip route add 192.168.0.1 dev eth0.2 encap ip6 dst 0x50 tc 0x50 hoplimit 100 table 1000
    ># ip route show table 1000
    192.168.0.1 encap ip6 id 0 src :: dst fe83::1 hoplimit 80 tc 100 dev eth0.2 scope link

    Signed-off-by: Quentin Armitage
    Signed-off-by: David S. Miller

    Quentin Armitage
     

27 Mar, 2016

1 commit

  • Pull Ceph updates from Sage Weil:
    "There is quite a bit here, including some overdue refactoring and
    cleanup on the mon_client and osd_client code from Ilya, scattered
    writeback support for CephFS and a pile of bug fixes from Zheng, and a
    few random cleanups and fixes from others"

    [ I already decided not to pull this because of it having been rebased
    recently, but ended up changing my mind after all. Next time I'll
    really hold people to it. Oh well. - Linus ]

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (34 commits)
    libceph: use KMEM_CACHE macro
    ceph: use kmem_cache_zalloc
    rbd: use KMEM_CACHE macro
    ceph: use lookup request to revalidate dentry
    ceph: kill ceph_get_dentry_parent_inode()
    ceph: fix security xattr deadlock
    ceph: don't request vxattrs from MDS
    ceph: fix mounting same fs multiple times
    ceph: remove unnecessary NULL check
    ceph: avoid updating directory inode's i_size accidentally
    ceph: fix race during filling readdir cache
    libceph: use sizeof_footer() more
    ceph: kill ceph_empty_snapc
    ceph: fix a wrong comparison
    ceph: replace CURRENT_TIME by current_fs_time()
    ceph: scattered page writeback
    libceph: add helper that duplicates last extent operation
    libceph: enable large, variable-sized OSD requests
    libceph: osdc->req_mempool should be backed by a slab pool
    libceph: make r_request msg_size calculation clearer
    ...

    Linus Torvalds
     

26 Mar, 2016

15 commits

  • Use KMEM_CACHE() instead of kmem_cache_create() to simplify the code.

    Signed-off-by: Geliang Tang
    Signed-off-by: Ilya Dryomov

    Geliang Tang
     
  • Don't open-code sizeof_footer() in read_partial_message() and
    ceph_msg_revoke(). Also, after switching to sizeof_footer(), it's now
    possible to use con_out_kvec_add() in prepare_write_message_footer().

    Signed-off-by: Ilya Dryomov
    Reviewed-by: Alex Elder

    Ilya Dryomov
     
  • This helper duplicates last extent operation in OSD request, then
    adjusts the new extent operation's offset and length. The helper
    is for scatterd page writeback, which adds nonconsecutive dirty
    pages to single OSD request.

    Signed-off-by: Yan, Zheng
    Signed-off-by: Ilya Dryomov

    Yan, Zheng
     
  • Turn r_ops into a flexible array member to enable large, consisting of
    up to 16 ops, OSD requests. The use case is scattered writeback in
    cephfs and, as far as the kernel client is concerned, 16 is just a made
    up number.

    r_ops had size 3 for copyup+hint+write, but copyup is really a special
    case - it can only happen once. ceph_osd_request_cache is therefore
    stuffed with num_ops=2 requests, anything bigger than that is allocated
    with kmalloc(). req_mempool is backed by ceph_osd_request_cache, which
    means either num_ops=1 or num_ops=2 for use_mempool=true - all existing
    users (ceph_writepages_start(), ceph_osdc_writepages()) are fine with
    that.

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     
  • ceph_osd_request_cache was introduced a long time ago. Also, osd_req
    is about to get a flexible array member, which ceph_osd_request_cache
    is going to be aware of.

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     
  • Although msg_size is calculated correctly, the terms are grouped in
    a misleading way - snaps appears to not have room for a u32 length.
    Move calculation closer to its use and regroup terms.

    No functional change.

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     
  • This avoids defining large array of r_reply_op_{len,result} in
    in struct ceph_osd_request.

    Signed-off-by: Yan, Zheng
    Signed-off-by: Ilya Dryomov

    Yan, Zheng
     
  • Follow userspace nomenclature on this - the next commit adds
    outdata_len.

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     
  • This can happen if __close_session() in ceph_monc_stop() races with
    a connection reset. We need to ignore such faults, otherwise it's
    likely we would take !hunting, call __schedule_delayed() and end up
    with delayed_work() executing on invalid memory, among other things.

    The (two!) con->private tests are useless, as nothing ever clears
    con->private. Nuke them.

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     
  • Doing __schedule_delayed() in the hunting branch is pointless, as the
    tick will have already been scheduled by then.

    What we need to do instead is *reschedule* it in the !hunting branch,
    after reopen_session() changes hunt_mult, which affects the delay.
    This helps with spacing out connection attempts and avoiding things
    like two back-to-back attempts followed by a longer period of waiting
    around.

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     
  • hunting is now set in __open_session() and cleared in finish_hunting(),
    instead of all around. The "session lost" message is printed not only
    on connection resets, but also on keepalive timeouts.

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     
  • Unless we are in the process of setting up a client (i.e. connecting to
    the monitor cluster for the first time), apply a backoff: every time we
    want to reopen a session, increase our timeout by a multiple (currently
    2); when we complete the connection, reduce that multipler by 50%.

    Mirrors ceph.git commit 794c86fd289bd62a35ed14368fa096c46736e9a2.

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     
  • Split ping interval and ping timeout: ping interval is 10s; keepalive
    timeout is 30s.

    Make monc_ping_timeout a constant while at it - it's not actually
    exported as a mount option (and the rest of tick-related settings won't
    be either), so it's got no place in ceph_options.

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     
  • Don't try to reconnect to the same monitor when we fail to establish
    a session within a timeout or it's lost.

    For that, pick_new_mon() needs to see the old value of cur_mon, so
    don't clear it in __close_session() - all calls to __close_session()
    but one are followed by __open_session() anyway. __open_session() is
    only called when a new session needs to be established, so the "already
    open?" branch, which is now in the way, is simply dropped.

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     
  • It is currently hard-coded in the mon_client that mdsmap and monmap
    subs are continuous, while osdmap sub is always "onetime". To better
    handle full clusters/pools in the osd_client, we need to be able to
    issue continuous osdmap subs. Revamp subs code to allow us to specify
    for each sub whether it should be continuous or not.

    Although not strictly required for the above, switch to SUBSCRIBE2
    protocol while at it, eliminating the ambiguity between a request for
    "every map since X" and a request for "just the latest" when we don't
    have a map yet (i.e. have epoch 0). SUBSCRIBE2 feature bit is now
    required - it's been supported since pre-argonaut (2010).

    Move "got mdsmap" call to the end of ceph_mdsc_handle_map() - calling
    in before we validate the epoch and successfully install the new map
    can mess up mon_client sub state.

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov