16 Jan, 2015

1 commit

  • softnet_data.input_pkt_queue is protected by a spinlock that
    we must hold when transferring packets from victim queue to an active
    one. This is because other cpus could still be trying to enqueue packets
    into victim queue.

    A second problem is that when we transfert the NAPI poll_list from
    victim to current cpu, we absolutely need to special case the percpu
    backlog, because we do not want to add complex locking to protect
    process_queue : Only owner cpu is allowed to manipulate it, unless cpu
    is offline.

    Based on initial patch from Prasad Sodagudi & Subash Abhinov
    Kasiviswanathan.

    This version is better because we do not slow down packet processing,
    only make migration safer.

    Reported-by: Prasad Sodagudi
    Reported-by: Subash Abhinov Kasiviswanathan
    Signed-off-by: Eric Dumazet
    Cc: Tom Herbert
    Signed-off-by: David S. Miller

    Eric Dumazet
     

14 Jan, 2015

1 commit

  • When setting base_reachable_time or base_reachable_time_ms on a
    specific interface through sysctl or netlink, the reachable_time
    value is not updated.

    This means that neighbour entries will continue to be updated using the
    old value until it is recomputed in neigh_period_work (which
    recomputes the value every 300*HZ).
    On systems with HZ equal to 1000 for instance, it means 5mins before
    the change is effective.

    This patch changes this behavior by recomputing reachable_time after
    each set on base_reachable_time or base_reachable_time_ms.
    The new value will become effective the next time the neighbour's timer
    is triggered.

    Changes are made in two places: the netlink code for set and the sysctl
    handling code. For sysctl, I use a proc_handler. The ipv6 network
    code does provide its own handler but it already refreshes
    reachable_time correctly so it's not an issue.
    Any other user of neighbour which provide its own handlers must
    refresh reachable_time.

    Signed-off-by: Jean-Francois Remy
    Signed-off-by: David S. Miller

    Jean-Francois Remy
     

27 Dec, 2014

2 commits

  • GSO isn't the only offload feature with restrictions that
    potentially can't be expressed with the current features mechanism.
    Checksum is another although it's a general issue that could in
    theory apply to anything. Even if it may be possible to
    implement these restrictions in other ways, it can result in
    duplicate code or inefficient per-packet behavior.

    This generalizes ndo_gso_check so that drivers can remove any
    features that don't make sense for a given packet, similar to
    netif_skb_features(). It also converts existing driver
    restrictions to the new format, completing the work that was
    done to support tunnel protocols since the issues apply to
    checksums as well.

    By actually removing features from the set that are used to do
    offloading, it solves another problem with the existing
    interface. In these cases, GSO would run with the original set
    of features and not do anything because it appears that
    segmentation is not required.

    CC: Tom Herbert
    CC: Joe Stringer
    CC: Eric Dumazet
    CC: Hayes Wang
    Signed-off-by: Jesse Gross
    Acked-by: Tom Herbert
    Fixes: 04ffcb255f22 ("net: Add ndo_gso_check")
    Tested-by: Hayes Wang
    Signed-off-by: David S. Miller

    Jesse Gross
     
  • When using VXLAN tunnels and a sky2 device, I have experienced
    checksum failures of the following type:

    [ 4297.761899] eth0: hw csum failure
    [...]
    [ 4297.765223] Call Trace:
    [ 4297.765224] [] dump_stack+0x46/0x58
    [ 4297.765235] [] netdev_rx_csum_fault+0x42/0x50
    [ 4297.765238] [] ? skb_push+0x40/0x40
    [ 4297.765240] [] __skb_checksum_complete+0xbc/0xd0
    [ 4297.765243] [] tcp_v4_rcv+0x2e2/0x950
    [ 4297.765246] [] ? ip_rcv_finish+0x360/0x360

    These are reliably reproduced in a network topology of:

    container:eth0 == host(OVS VXLAN on VLAN) == bond0 == eth0 (sky2) -> switch

    When VXLAN encapsulated traffic is received from a similarly
    configured peer, the above warning is generated in the receive
    processing of the encapsulated packet. Note that the warning is
    associated with the container eth0.

    The skbs from sky2 have ip_summed set to CHECKSUM_COMPLETE, and
    because the packet is an encapsulated Ethernet frame, the checksum
    generated by the hardware includes the inner protocol and Ethernet
    headers.

    The receive code is careful to update the skb->csum, except in
    __dev_forward_skb, as called by dev_forward_skb. __dev_forward_skb
    calls eth_type_trans, which in turn calls skb_pull_inline(skb, ETH_HLEN)
    to skip over the Ethernet header, but does not update skb->csum when
    doing so.

    This patch resolves the problem by adding a call to
    skb_postpull_rcsum to update the skb->csum after the call to
    eth_type_trans.

    Signed-off-by: Jay Vosburgh
    Signed-off-by: David S. Miller

    Jay Vosburgh
     

24 Dec, 2014

8 commits

  • skb_scrub_packet() is called when a packet switches between a context
    such as between underlay and overlay, between namespaces, or between
    L3 subnets.

    While we already scrub the packet mark, connection tracking entry,
    and cached destination, the security mark/context is left intact.

    It seems wrong to inherit the security context of a packet when going
    from overlay to underlay or across forwarding paths.

    Signed-off-by: Thomas Graf
    Acked-by: Flavio Leitner
    Signed-off-by: David S. Miller

    Thomas Graf
     
  • When vlan tags are stacked, it is very likely that the outer tag is stored
    in skb->vlan_tci and skb->protocol shows the inner tag's vlan_proto.
    Currently netif_skb_features() first looks at skb->protocol even if there
    is the outer tag in vlan_tci, thus it incorrectly retrieves the protocol
    encapsulated by the inner vlan instead of the inner vlan protocol.
    This allows GSO packets to be passed to HW and they end up being
    corrupted.

    Fixes: 58e998c6d239 ("offloading: Force software GSO for multiple vlan tags.")
    Signed-off-by: Toshiaki Makita
    Signed-off-by: David S. Miller

    Toshiaki Makita
     
  • Fixes MPLS GSO for case when mpls is compiled as kernel module.

    Fixes: 0d89d2035f ("MPLS: Add limited GSO support").
    Signed-off-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    Pravin B Shelar
     
  • This patch rearranges the loop in net_rx_action to reduce the
    amount of jumping back and forth when reading the code.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     
  • We should only perform the softnet_break check after we have polled
    at least one device in net_rx_action. Otherwise a zero or negative
    setting of netdev_budget can lock up the whole system.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     
  • The commit d75b1ade567ffab085e8adbbdacf0092d10cd09c (net: less
    interrupt masking in NAPI) required drivers to leave poll_list
    empty if the entire budget is consumed.

    We have already had two broken drivers so let's add a check for
    this.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     
  • This patch creates a new function napi_poll and moves the napi
    polling code from net_rx_action into it.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     
  • Commit cecda693a969816bac5e470e1d9c9c0ef5567bca ("net: keep original skb
    which only needs header checking during software GSO") keeps the original
    skb for packets that only needs header check, but it doesn't drop the
    packet if software segmentation or header check were failed.

    Fixes cecda693a9 ("net: keep original skb which only needs header checking during software GSO")
    Cc: Eric Dumazet
    Signed-off-by: Jason Wang
    Signed-off-by: David S. Miller

    Jason Wang
     

19 Dec, 2014

1 commit

  • Pull networking fixes from David Miller:

    1) Fix NBMA tunnel mac header handling in GRE, from Timo Teräs.

    2) Fix a NAPI race in the fec driver, from Nimrod Andy.

    3) The new IFF_VNET_LE bit is outside the size of the flags member it
    is stored in (which is 16-bits), store the state locally in the
    drivers. From Michael S Tsirkin.

    4) We are kicking the tires with the new wireless maintainership
    situation. Bluetooth fixes via Johan Hedberg, and mac80211 fixes
    from Johannes Berg.

    5) Fix locking and leaks in geneve driver, from Jesse Gross.

    6) Make netlink TX mmap code always copy, so we don't have to be
    potentially exposed to the user changing the underlying contents
    from underneath us.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (63 commits)
    be2net: Fix incorrect setting of tunnel offload flag in netdev features
    bnx2x: fix typos in "configure"
    xen-netback: support frontends without feature-rx-notify again
    MAINTAINERS: changes for wireless
    cxgb4: Fix decoding QSA module for ethtool get settings
    geneve: Fix races between socket add and release.
    geneve: Remove socket and offload handlers at destruction.
    netlink: Don't reorder loads/stores before marking mmap netlink frame as available
    netlink: Always copy on mmap TX.
    Bluetooth: Fix bug with filter in service discovery optimization
    mac80211: free management frame keys when removing station
    net: Disallow providing non zero VLAN ID for NIC drivers FDB add flow
    net/mlx4: Cache line CQE/EQE stride fixes
    net: fec: Fix NAPI race
    xen-netfront: use napi_complete() correctly to prevent Rx stalling
    ip_tunnel: Add missing validation of encap type to ip_tunnel_encap_setup()
    ip_tunnel: Add sanity checks to ip_tunnel_encap_add_ops()
    net: Allow FIXED_PHY to be modular.
    if_tun: drop broken IFF_VNET_LE
    macvtap: drop broken IFF_VNET_LE
    ...

    Linus Torvalds
     

17 Dec, 2014

2 commits

  • Pull vfs pile #2 from Al Viro:
    "Next pile (and there'll be one or two more).

    The large piece in this one is getting rid of /proc/*/ns/* weirdness;
    among other things, it allows to (finally) make nameidata completely
    opaque outside of fs/namei.c, making for easier further cleanups in
    there"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    coda_venus_readdir(): use file_inode()
    fs/namei.c: fold link_path_walk() call into path_init()
    path_init(): don't bother with LOOKUP_PARENT in argument
    fs/namei.c: new helper (path_cleanup())
    path_init(): store the "base" pointer to file in nameidata itself
    make default ->i_fop have ->open() fail with ENXIO
    make nameidata completely opaque outside of fs/namei.c
    kill proc_ns completely
    take the targets of /proc/*/ns/* symlinks to separate fs
    bury struct proc_ns in fs/proc
    copy address of proc_ns_ops into ns_common
    new helpers: ns_alloc_inum/ns_free_inum
    make proc_ns_operations work with struct ns_common * instead of void *
    switch the rest of proc_ns_operations to working with &...->ns
    netns: switch ->get()/->put()/->install()/->inum() to working with &net->ns
    make mntns ->get()/->put()/->install()/->inum() work with &mnt_ns->ns
    common object embedded into various struct ....ns

    Linus Torvalds
     
  • The current implementations all use dev_uc_add_excl() and such whose API
    doesn't support vlans, so we can't make it with NICs HW for now.

    Fixes: f6f6424ba773 ('net: make vid as a parameter for ndo_fdb_add/ndo_fdb_del')
    Signed-off-by: Or Gerlitz
    Reviewed-by: Jiri Pirko
    Acked-by: Jeff Kirsher
    Signed-off-by: David S. Miller

    Or Gerlitz
     

14 Dec, 2014

1 commit

  • Pull crypto update from Herbert Xu:
    - The crypto API is now documented :)
    - Disallow arbitrary module loading through crypto API.
    - Allow get request with empty driver name through crypto_user.
    - Allow speed testing of arbitrary hash functions.
    - Add caam support for ctr(aes), gcm(aes) and their derivatives.
    - nx now supports concurrent hashing properly.
    - Add sahara support for SHA1/256.
    - Add ARM64 version of CRC32.
    - Misc fixes.

    * git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (77 commits)
    crypto: tcrypt - Allow speed testing of arbitrary hash functions
    crypto: af_alg - add user space interface for AEAD
    crypto: qat - fix problem with coalescing enable logic
    crypto: sahara - add support for SHA1/256
    crypto: sahara - replace tasklets with kthread
    crypto: sahara - add support for i.MX53
    crypto: sahara - fix spinlock initialization
    crypto: arm - replace memset by memzero_explicit
    crypto: powerpc - replace memset by memzero_explicit
    crypto: sha - replace memset by memzero_explicit
    crypto: sparc - replace memset by memzero_explicit
    crypto: algif_skcipher - initialize upon init request
    crypto: algif_skcipher - removed unneeded code
    crypto: algif_skcipher - Fixed blocking recvmsg
    crypto: drbg - use memzero_explicit() for clearing sensitive data
    crypto: drbg - use MODULE_ALIAS_CRYPTO
    crypto: include crypto- module prefix in template
    crypto: user - add MODULE_ALIAS
    crypto: sha-mb - remove a bogus NULL check
    crytpo: qat - Fix 64 bytes requests
    ...

    Linus Torvalds
     

12 Dec, 2014

1 commit

  • Pull networking updates from David Miller:

    1) New offloading infrastructure and example 'rocker' driver for
    offloading of switching and routing to hardware.

    This work was done by a large group of dedicated individuals, not
    limited to: Scott Feldman, Jiri Pirko, Thomas Graf, John Fastabend,
    Jamal Hadi Salim, Andy Gospodarek, Florian Fainelli, Roopa Prabhu

    2) Start making the networking operate on IOV iterators instead of
    modifying iov objects in-situ during transfers. Thanks to Al Viro
    and Herbert Xu.

    3) A set of new netlink interfaces for the TIPC stack, from Richard
    Alpe.

    4) Remove unnecessary looping during ipv6 routing lookups, from Martin
    KaFai Lau.

    5) Add PAUSE frame generation support to gianfar driver, from Matei
    Pavaluca.

    6) Allow for larger reordering levels in TCP, which are easily
    achievable in the real world right now, from Eric Dumazet.

    7) Add a variable of napi_schedule that doesn't need to disable cpu
    interrupts, from Eric Dumazet.

    8) Use a doubly linked list to optimize neigh_parms_release(), from
    Nicolas Dichtel.

    9) Various enhancements to the kernel BPF verifier, and allow eBPF
    programs to actually be attached to sockets. From Alexei
    Starovoitov.

    10) Support TSO/LSO in sunvnet driver, from David L Stevens.

    11) Allow controlling ECN usage via routing metrics, from Florian
    Westphal.

    12) Remote checksum offload, from Tom Herbert.

    13) Add split-header receive, BQL, and xmit_more support to amd-xgbe
    driver, from Thomas Lendacky.

    14) Add MPLS support to openvswitch, from Simon Horman.

    15) Support wildcard tunnel endpoints in ipv6 tunnels, from Steffen
    Klassert.

    16) Do gro flushes on a per-device basis using a timer, from Eric
    Dumazet. This tries to resolve the conflicting goals between the
    desired handling of bulk vs. RPC-like traffic.

    17) Allow userspace to ask for the CPU upon what a packet was
    received/steered, via SO_INCOMING_CPU. From Eric Dumazet.

    18) Limit GSO packets to half the current congestion window, from Eric
    Dumazet.

    19) Add a generic helper so that all drivers set their RSS keys in a
    consistent way, from Eric Dumazet.

    20) Add xmit_more support to enic driver, from Govindarajulu
    Varadarajan.

    21) Add VLAN packet scheduler action, from Jiri Pirko.

    22) Support configurable RSS hash functions via ethtool, from Eyal
    Perry.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1820 commits)
    Fix race condition between vxlan_sock_add and vxlan_sock_release
    net/macb: fix compilation warning for print_hex_dump() called with skb->mac_header
    net/mlx4: Add support for A0 steering
    net/mlx4: Refactor QUERY_PORT
    net/mlx4_core: Add explicit error message when rule doesn't meet configuration
    net/mlx4: Add A0 hybrid steering
    net/mlx4: Add mlx4_bitmap zone allocator
    net/mlx4: Add a check if there are too many reserved QPs
    net/mlx4: Change QP allocation scheme
    net/mlx4_core: Use tasklet for user-space CQ completion events
    net/mlx4_core: Mask out host side virtualization features for guests
    net/mlx4_en: Set csum level for encapsulated packets
    be2net: Export tunnel offloads only when a VxLAN tunnel is created
    gianfar: Fix dma check map error when DMA_API_DEBUG is enabled
    cxgb4/csiostor: Don't use MASTER_MUST for fw_hello call
    net: fec: only enable mdio interrupt before phy device link up
    net: fec: clear all interrupt events to support i.MX6SX
    net: fec: reset fep link status in suspend function
    net: sock: fix access via invalid file descriptor
    net: introduce helper macro for_each_cmsghdr
    ...

    Linus Torvalds
     

11 Dec, 2014

7 commits

  • 0day robot reported the following crash:
    [ 21.233581] BUG: unable to handle kernel NULL pointer dereference at 0000000000000007
    [ 21.234709] IP: [] sk_attach_bpf+0x39/0xc2

    It's due to bpf_prog_get() returning ERR_PTR.
    Check it properly.

    Reported-by: Fengguang Wu
    Fixes: 89aa075832b0 ("net: sock: allow eBPF programs to be attached to sockets")
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • Introduce helper macro for_each_cmsghdr as a wrapper of the enumerating
    cmsghdr from msghdr, just cleanup.

    Signed-off-by: Gu Zheng
    Signed-off-by: David S. Miller

    Gu Zheng
     
  • Al Viro
     
  • Conflicts:
    drivers/net/ethernet/amd/xgbe/xgbe-desc.c
    drivers/net/ethernet/renesas/sh_eth.c

    Overlapping changes in both conflict cases.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • This change pulls the core functionality out of __netdev_alloc_skb and
    places them in a new function named __alloc_rx_skb. The reason for doing
    this is to make these bits accessible to a new function __napi_alloc_skb.
    In addition __alloc_rx_skb now has a new flags value that is used to
    determine which page frag pool to allocate from. If the SKB_ALLOC_NAPI
    flag is set then the NAPI pool is used. The advantage of this is that we
    do not have to use local_irq_save/restore when accessing the NAPI pool from
    NAPI context.

    In my test setup I saw at least 11ns of savings using the napi_alloc_skb
    function versus the netdev_alloc_skb function, most of this being due to
    the fact that we didn't have to call local_irq_save/restore.

    The main use case for napi_alloc_skb would be for things such as copybreak
    or page fragment based receive paths where an skb is allocated after the
    data has been received instead of before.

    Signed-off-by: Alexander Duyck
    Signed-off-by: David S. Miller

    Alexander Duyck
     
  • This patch splits the netdev_alloc_frag function up so that it can be used
    on one of two page frag pools instead of being fixed on the
    netdev_alloc_cache. By doing this we can add a NAPI specific function
    __napi_alloc_frag that accesses a pool that is only used from softirq
    context. The advantage to this is that we do not need to call
    local_irq_save/restore which can be a significant savings.

    I also took the opportunity to refactor the core bits that were placed in
    __alloc_page_frag. First I updated the allocation to do either a 32K
    allocation or an order 0 page. This is based on the changes in commmit
    d9b2938aa where it was found that latencies could be reduced in case of
    failures. Then I also rewrote the logic to work from the end of the page to
    the start. By doing this the size value doesn't have to be used unless we
    have run out of space for page fragments. Finally I cleaned up the atomic
    bits so that we just do an atomic_sub_and_test and if that returns true then
    we set the page->_count via an atomic_set. This way we can remove the extra
    conditional for the atomic_read since it would have led to an atomic_inc in
    the case of success anyway.

    Signed-off-by: Alexander Duyck
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexander Duyck
     
  • More iov_iter work for the networking from Al Viro.

    Signed-off-by: David S. Miller

    David S. Miller
     

10 Dec, 2014

8 commits

  • Pull scheduler updates from Ingo Molnar:
    "The main changes in this cycle are:

    - 'Nested Sleep Debugging', activated when CONFIG_DEBUG_ATOMIC_SLEEP=y.

    This instruments might_sleep() checks to catch places that nest
    blocking primitives - such as mutex usage in a wait loop. Such
    bugs can result in hard to debug races/hangs.

    Another category of invalid nesting that this facility will detect
    is the calling of blocking functions from within schedule() ->
    sched_submit_work() -> blk_schedule_flush_plug().

    There's some potential for false positives (if secondary blocking
    primitives themselves are not ready yet for this facility), but the
    kernel will warn once about such bugs per bootup, so the warning
    isn't much of a nuisance.

    This feature comes with a number of fixes, for problems uncovered
    with it, so no messages are expected normally.

    - Another round of sched/numa optimizations and refinements, for
    CONFIG_NUMA_BALANCING=y.

    - Another round of sched/dl fixes and refinements.

    Plus various smaller fixes and cleanups"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (54 commits)
    sched: Add missing rcu protection to wake_up_all_idle_cpus
    sched/deadline: Introduce start_hrtick_dl() for !CONFIG_SCHED_HRTICK
    sched/numa: Init numa balancing fields of init_task
    sched/deadline: Remove unnecessary definitions in cpudeadline.h
    sched/cpupri: Remove unnecessary definitions in cpupri.h
    sched/deadline: Fix rq->dl.pushable_tasks bug in push_dl_task()
    sched/fair: Fix stale overloaded status in the busiest group finding logic
    sched: Move p->nr_cpus_allowed check to select_task_rq()
    sched/completion: Document when to use wait_for_completion_io_*()
    sched: Update comments about CLONE_NEWUTS and CLONE_NEWIPC
    sched/fair: Kill task_struct::numa_entry and numa_group::task_list
    sched: Refactor task_struct to use numa_faults instead of numa_* pointers
    sched/deadline: Don't check CONFIG_SMP in switched_from_dl()
    sched/deadline: Reschedule from switched_from_dl() after a successful pull
    sched/deadline: Push task away if the deadline is equal to curr during wakeup
    sched/deadline: Add deadline rq status print
    sched/deadline: Fix artificial overrun introduced by yield_task_dl()
    sched/rt: Clean up check_preempt_equal_prio()
    sched/core: Use dl_bw_of() under rcu_read_lock_sched()
    sched: Check if we got a shallowest_idle_cpu before searching for least_loaded_cpu
    ...

    Linus Torvalds
     
  • Remove use of 'swdev' mode in rocker. rocker dev offloads
    can use the BRIDGE_FLAGS_SELF to indicate offload to hardware.

    Signed-off-by: Roopa Prabhu
    Signed-off-by: Scott Feldman
    Signed-off-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Roopa Prabhu
     
  • the queue length of sd->input_pkt_queue has been put into qlen,
    and impossible to change, since hold the lock

    Signed-off-by: Li RongQing
    Acked-by: Eric Dumazet
    Cc: Sergei Shtylyov
    Signed-off-by: David S. Miller

    Li RongQing
     
  • no callers other than itself.

    Signed-off-by: Al Viro

    Al Viro
     
  • ... making both non-draining. That means that tcp_recvmsg() becomes
    non-draining. And _that_ would break iscsit_do_rx_data() unless we
    a) make sure tcp_recvmsg() is uniformly non-draining (it is)
    b) make sure it copes with arbitrary (including shifted)
    iov_iter (it does, all it uses is iov_iter primitives)
    c) make iscsit_do_rx_data() initialize ->msg_iter only once.

    Fortunately, (c) is doable with minimal work and we are rid of one
    the two places where kernel send/recvmsg users would be unhappy with
    non-draining behaviour.

    Actually, that makes all but one of ->recvmsg() instances iov_iter-clean.
    The exception is skcipher_recvmsg() and it also isn't hard to convert
    to primitives (iov_iter_get_pages() is needed there). That'll wait
    a bit - there's some interplay with ->sendmsg() path for that one.

    Signed-off-by: Al Viro

    Al Viro
     
  • Since commit f8864972126899 ("ipv4: fix dst race in sk_dst_get()")
    DST_NOCACHE dst_entries get freed by RCU. So there is no need to get a
    reference on them when we are in rcu protected sections.

    Cc: Eric Dumazet
    Cc: Julian Anastasov
    Signed-off-by: Hannes Frederic Sowa
    Reviewed-by: Julian Anastasov
    Signed-off-by: David S. Miller

    Hannes Frederic Sowa
     
  • Commit ce1a4ea3f125 ("net: avoid one atomic operation in skb_clone()")
    took the wrong way to save one atomic operation.

    It is actually possible to avoid two atomic operations, if we
    do not change skb->fclone values, and only rely on clone_ref
    content to signal if the clone is available or not.

    skb_clone() can simply use the fast clone if clone_ref is 1.

    kfree_skbmem() can avoid the atomic_dec_and_test() if clone_ref is 1.

    Note that because we usually free the clone before the original skb,
    this particular attempt is only done for the original skb to have better
    branch prediction.

    SKB_FCLONE_FREE is removed.

    Signed-off-by: Eric Dumazet
    Cc: Chris Mason
    Cc: Sabrina Dubroca
    Cc: Vijay Subramanian
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • The commit 56bfa7ee7c ("unregister_netdevice : move RTM_DELLINK to
    until after ndo_uninit") tried to do this ealier but while doing so
    it created a problem. Unfortunately the delayed rtmsg_ifinfo() also
    delayed call to fill_info(). So this translated into asking driver
    to remove private state and then query it's private state. This
    could have catastropic consequences.

    This change breaks the rtmsg_ifinfo() into two parts - one takes the
    precise snapshot of the device by called fill_info() before calling
    the ndo_uninit() and the second part sends the notification using
    collected snapshot.

    It was brought to notice when last link is deleted from an ipvlan device
    when it has free-ed the port and the subsequent .fill_info() call is
    trying to get the info from the port.

    kernel: [ 255.139429] ------------[ cut here ]------------
    kernel: [ 255.139439] WARNING: CPU: 12 PID: 11173 at net/core/rtnetlink.c:2238 rtmsg_ifinfo+0x100/0x110()
    kernel: [ 255.139493] Modules linked in: ipvlan bonding w1_therm ds2482 wire cdc_acm ehci_pci ehci_hcd i2c_dev i2c_i801 i2c_core msr cpuid bnx2x ptp pps_core mdio libcrc32c
    kernel: [ 255.139513] CPU: 12 PID: 11173 Comm: ip Not tainted 3.18.0-smp-DEV #167
    kernel: [ 255.139514] Hardware name: Intel RML,PCH/Ibis_QC_18, BIOS 1.0.10 05/15/2012
    kernel: [ 255.139515] 0000000000000009 ffff880851b6b828 ffffffff815d87f4 00000000000000e0
    kernel: [ 255.139516] 0000000000000000 ffff880851b6b868 ffffffff8109c29c 0000000000000000
    kernel: [ 255.139518] 00000000ffffffa6 00000000000000d0 ffffffff81aaf580 0000000000000011
    kernel: [ 255.139520] Call Trace:
    kernel: [ 255.139527] [] dump_stack+0x46/0x58
    kernel: [ 255.139531] [] warn_slowpath_common+0x8c/0xc0
    kernel: [ 255.139540] [] warn_slowpath_null+0x1a/0x20
    kernel: [ 255.139544] [] rtmsg_ifinfo+0x100/0x110
    kernel: [ 255.139547] [] rollback_registered_many+0x1d5/0x2d0
    kernel: [ 255.139549] [] unregister_netdevice_many+0x1f/0xb0
    kernel: [ 255.139551] [] rtnl_dellink+0xbb/0x110
    kernel: [ 255.139553] [] rtnetlink_rcv_msg+0xa0/0x240
    kernel: [ 255.139557] [] ? rhashtable_lookup_compare+0x43/0x80
    kernel: [ 255.139558] [] ? __rtnl_unlock+0x20/0x20
    kernel: [ 255.139562] [] netlink_rcv_skb+0xb1/0xc0
    kernel: [ 255.139563] [] rtnetlink_rcv+0x25/0x40
    kernel: [ 255.139565] [] netlink_unicast+0x178/0x230
    kernel: [ 255.139567] [] netlink_sendmsg+0x30f/0x420
    kernel: [ 255.139571] [] sock_sendmsg+0x9c/0xd0
    kernel: [ 255.139575] [] ? rw_copy_check_uvector+0x6f/0x130
    kernel: [ 255.139577] [] ? copy_msghdr_from_user+0x139/0x1b0
    kernel: [ 255.139578] [] ___sys_sendmsg+0x304/0x310
    kernel: [ 255.139581] [] ? handle_mm_fault+0xca3/0xde0
    kernel: [ 255.139585] [] ? destroy_inode+0x3c/0x70
    kernel: [ 255.139589] [] ? __do_page_fault+0x20c/0x500
    kernel: [ 255.139597] [] ? dput+0xb6/0x190
    kernel: [ 255.139606] [] ? mntput+0x26/0x40
    kernel: [ 255.139611] [] ? __fput+0x174/0x1e0
    kernel: [ 255.139613] [] __sys_sendmsg+0x49/0x90
    kernel: [ 255.139615] [] SyS_sendmsg+0x12/0x20
    kernel: [ 255.139617] [] system_call_fastpath+0x12/0x17
    kernel: [ 255.139619] ---[ end trace 5e6703e87d984f6b ]---

    Signed-off-by: Mahesh Bandewar
    Reported-by: Toshiaki Makita
    Cc: Eric Dumazet
    Cc: Roopa Prabhu
    Cc: David S. Miller
    Acked-by: Eric Dumazet
    Acked-by: Thomas Graf
    Signed-off-by: David S. Miller

    Mahesh Bandewar
     

09 Dec, 2014

1 commit

  • This patch extends the set/get_rxfh ethtool-options for getting or
    setting the RSS hash function.

    It modifies drivers implementation of set/get_rxfh accordingly.

    This change also delegates the responsibility of checking whether a
    modification to a certain RX flow hash parameter is supported to the
    driver implementation of set_rxfh.

    User-kernel API is done through the new hfunc bitmask field in the
    ethtool_rxfh struct. A bit set in the hfunc field is corresponding to an
    index in the new string-set ETH_SS_RSS_HASH_FUNCS.

    Got approval from most of the relevant driver maintainers that their
    driver is using Toeplitz, and for the few that didn't answered, also
    assumed it is Toeplitz.

    Cc: Tom Lendacky
    Cc: Ariel Elior
    Cc: Prashant Sreedharan
    Cc: Michael Chan
    Cc: Hariprasad S
    Cc: Sathya Perla
    Cc: Subbu Seetharaman
    Cc: Ajit Khaparde
    Cc: Jeff Kirsher
    Cc: Jesse Brandeburg
    Cc: Bruce Allan
    Cc: Carolyn Wyborny
    Cc: Don Skidmore
    Cc: Greg Rose
    Cc: Matthew Vick
    Cc: John Ronciak
    Cc: Mitch Williams
    Cc: Amir Vadai
    Cc: Solarflare linux maintainers
    Cc: Shradha Shah
    Cc: Shreyas Bhatewara
    Cc: "VMware, Inc."
    Cc: Ben Hutchings
    Signed-off-by: Eyal Perry
    Signed-off-by: Amir Vadai
    Signed-off-by: David S. Miller

    Eyal Perry
     

06 Dec, 2014

1 commit

  • introduce new setsockopt() command:

    setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, &prog_fd, sizeof(prog_fd))

    where prog_fd was received from syscall bpf(BPF_PROG_LOAD, attr, ...)
    and attr->prog_type == BPF_PROG_TYPE_SOCKET_FILTER

    setsockopt() calls bpf_prog_get() which increments refcnt of the program,
    so it doesn't get unloaded while socket is using the program.

    The same eBPF program can be attached to multiple sockets.

    User task exit automatically closes socket which calls sk_filter_uncharge()
    which decrements refcnt of eBPF program

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

05 Dec, 2014

6 commits