19 Apr, 2014

1 commit

  • Currently, it is possible to create an SCTP socket, then switch
    auth_enable via sysctl setting to 1 and crash the system on connect:

    Oops[#1]:
    CPU: 0 PID: 0 Comm: swapper Not tainted 3.14.1-mipsgit-20140415 #1
    task: ffffffff8056ce80 ti: ffffffff8055c000 task.ti: ffffffff8055c000
    [...]
    Call Trace:
    [] sctp_auth_asoc_set_default_hmac+0x68/0x80
    [] sctp_process_init+0x5e0/0x8a4
    [] sctp_sf_do_5_1B_init+0x234/0x34c
    [] sctp_do_sm+0xb4/0x1e8
    [] sctp_endpoint_bh_rcv+0x1c4/0x214
    [] sctp_rcv+0x588/0x630
    [] sctp6_rcv+0x10/0x24
    [] ip6_input+0x2c0/0x440
    [] __netif_receive_skb_core+0x4a8/0x564
    [] process_backlog+0xb4/0x18c
    [] net_rx_action+0x12c/0x210
    [] __do_softirq+0x17c/0x2ac
    [] irq_exit+0x54/0xb0
    [] ret_from_irq+0x0/0x4
    [] rm7k_wait_irqoff+0x24/0x48
    [] cpu_startup_entry+0xc0/0x148
    [] start_kernel+0x37c/0x398
    Code: dd0900b8 000330f8 0126302d 50c0fff1 0047182a a48306a0
    03e00008 00000000
    ---[ end trace b530b0551467f2fd ]---
    Kernel panic - not syncing: Fatal exception in interrupt

    What happens while auth_enable=0 in that case is, that
    ep->auth_hmacs is initialized to NULL in sctp_auth_init_hmacs()
    when endpoint is being created.

    After that point, if an admin switches over to auth_enable=1,
    the machine can crash due to NULL pointer dereference during
    reception of an INIT chunk. When we enter sctp_process_init()
    via sctp_sf_do_5_1B_init() in order to respond to an INIT chunk,
    the INIT verification succeeds and while we walk and process
    all INIT params via sctp_process_param() we find that
    net->sctp.auth_enable is set, therefore do not fall through,
    but invoke sctp_auth_asoc_set_default_hmac() instead, and thus,
    dereference what we have set to NULL during endpoint
    initialization phase.

    The fix is to make auth_enable immutable by caching its value
    during endpoint initialization, so that its original value is
    being carried along until destruction. The bug seems to originate
    from the very first days.

    Fix in joint work with Daniel Borkmann.

    Reported-by: Joshua Kinard
    Signed-off-by: Vlad Yasevich
    Signed-off-by: Daniel Borkmann
    Acked-by: Neil Horman
    Tested-by: Joshua Kinard
    Signed-off-by: David S. Miller

    Vlad Yasevich
     

17 Apr, 2014

1 commit

  • As suggested by Julian:

    Simply, flowi4_iif must not contain 0, it does not
    look logical to ignore all ip rules with specified iif.

    because in fib_rule_match() we do:

    if (rule->iifindex && (rule->iifindex != fl->flowi_iif))
    goto out;

    flowi4_iif should be LOOPBACK_IFINDEX by default.

    We need to move LOOPBACK_IFINDEX to include/net/flow.h:

    1) It is mostly used by flowi_iif

    2) Fix the following compile error if we use it in flow.h
    by the patches latter:

    In file included from include/linux/netfilter.h:277:0,
    from include/net/netns/netfilter.h:5,
    from include/net/net_namespace.h:21,
    from include/linux/netdevice.h:43,
    from include/linux/icmpv6.h:12,
    from include/linux/ipv6.h:61,
    from include/net/ipv6.h:16,
    from include/linux/sunrpc/clnt.h:27,
    from include/linux/nfs_fs.h:30,
    from init/do_mounts.c:32:
    include/net/flow.h: In function ‘flowi4_init_output’:
    include/net/flow.h:84:32: error: ‘LOOPBACK_IFINDEX’ undeclared (first use in this function)

    Cc: Eric Biederman
    Cc: Julian Anastasov
    Cc: David S. Miller
    Signed-off-by: Cong Wang
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    Cong Wang
     

16 Apr, 2014

2 commits

  • In the dst->output() path for ipv4, the code assumes the skb it has to
    transmit is attached to an inet socket, specifically via
    ip_mc_output() : The sk_mc_loop() test triggers a WARN_ON() when the
    provider of the packet is an AF_PACKET socket.

    The dst->output() method gets an additional 'struct sock *sk'
    parameter. This needs a cascade of changes so that this parameter can
    be propagated from vxlan to final consumer.

    Fixes: 8f646c922d55 ("vxlan: keep original skb ownership")
    Reported-by: lucien xin
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • ip_queue_xmit() assumes the skb it has to transmit is attached to an
    inet socket. Commit 31c70d5956fc ("l2tp: keep original skb ownership")
    changed l2tp to not change skb ownership and thus broke this assumption.

    One fix is to add a new 'struct sock *sk' parameter to ip_queue_xmit(),
    so that we do not assume skb->sk points to the socket used by l2tp
    tunnel.

    Fixes: 31c70d5956fc ("l2tp: keep original skb ownership")
    Reported-by: Zhan Jianyu
    Tested-by: Zhan Jianyu
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

15 Apr, 2014

3 commits

  • Pablo Neira Ayuso says:

    ====================
    Netfilter fixes for net

    The following patchset contains three Netfilter fixes for your net tree,
    they are:

    * Fix missing generation sequence initialization which results in a splat
    if lockdep is enabled, it was introduced in the recent works to improve
    nf_conntrack scalability, from Andrey Vagin.

    * Don't flush the GRE keymap list in nf_conntrack when the pptp helper is
    disabled otherwise this crashes due to a double release, from Andrey
    Vagin.

    * Fix nf_tables cmp fast in big endian, from Patrick McHardy.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • This reverts commit ef2820a735f7 ("net: sctp: Fix a_rwnd/rwnd management
    to reflect real state of the receiver's buffer") as it introduced a
    serious performance regression on SCTP over IPv4 and IPv6, though a not
    as dramatic on the latter. Measurements are on 10Gbit/s with ixgbe NICs.

    Current state:

    [root@Lab200slot2 ~]# iperf3 --sctp -4 -c 192.168.241.3 -V -l 1452 -t 60
    iperf version 3.0.1 (10 January 2014)
    Linux Lab200slot2 3.14.0 #1 SMP Thu Apr 3 23:18:29 EDT 2014 x86_64
    Time: Fri, 11 Apr 2014 17:56:21 GMT
    Connecting to host 192.168.241.3, port 5201
    Cookie: Lab200slot2.1397238981.812898.548918
    [ 4] local 192.168.241.2 port 38616 connected to 192.168.241.3 port 5201
    Starting Test: protocol: SCTP, 1 streams, 1452 byte blocks, omitting 0 seconds, 60 second test
    [ ID] Interval Transfer Bandwidth
    [ 4] 0.00-1.09 sec 20.8 MBytes 161 Mbits/sec
    [ 4] 1.09-2.13 sec 10.8 MBytes 86.8 Mbits/sec
    [ 4] 2.13-3.15 sec 3.57 MBytes 29.5 Mbits/sec
    [ 4] 3.15-4.16 sec 4.33 MBytes 35.7 Mbits/sec
    [ 4] 4.16-6.21 sec 10.4 MBytes 42.7 Mbits/sec
    [ 4] 6.21-6.21 sec 0.00 Bytes 0.00 bits/sec
    [ 4] 6.21-7.35 sec 34.6 MBytes 253 Mbits/sec
    [ 4] 7.35-11.45 sec 22.0 MBytes 45.0 Mbits/sec
    [ 4] 11.45-11.45 sec 0.00 Bytes 0.00 bits/sec
    [ 4] 11.45-11.45 sec 0.00 Bytes 0.00 bits/sec
    [ 4] 11.45-11.45 sec 0.00 Bytes 0.00 bits/sec
    [ 4] 11.45-12.51 sec 16.0 MBytes 126 Mbits/sec
    [ 4] 12.51-13.59 sec 20.3 MBytes 158 Mbits/sec
    [ 4] 13.59-14.65 sec 13.4 MBytes 107 Mbits/sec
    [ 4] 14.65-16.79 sec 33.3 MBytes 130 Mbits/sec
    [ 4] 16.79-16.79 sec 0.00 Bytes 0.00 bits/sec
    [ 4] 16.79-17.82 sec 5.94 MBytes 48.7 Mbits/sec
    (etc)

    [root@Lab200slot2 ~]# iperf3 --sctp -6 -c 2001:db8:0:f101::1 -V -l 1400 -t 60
    iperf version 3.0.1 (10 January 2014)
    Linux Lab200slot2 3.14.0 #1 SMP Thu Apr 3 23:18:29 EDT 2014 x86_64
    Time: Fri, 11 Apr 2014 19:08:41 GMT
    Connecting to host 2001:db8:0:f101::1, port 5201
    Cookie: Lab200slot2.1397243321.714295.2b3f7c
    [ 4] local 2001:db8:0:f101::2 port 55804 connected to 2001:db8:0:f101::1 port 5201
    Starting Test: protocol: SCTP, 1 streams, 1400 byte blocks, omitting 0 seconds, 60 second test
    [ ID] Interval Transfer Bandwidth
    [ 4] 0.00-1.00 sec 169 MBytes 1.42 Gbits/sec
    [ 4] 1.00-2.00 sec 201 MBytes 1.69 Gbits/sec
    [ 4] 2.00-3.00 sec 188 MBytes 1.58 Gbits/sec
    [ 4] 3.00-4.00 sec 174 MBytes 1.46 Gbits/sec
    [ 4] 4.00-5.00 sec 165 MBytes 1.39 Gbits/sec
    [ 4] 5.00-6.00 sec 199 MBytes 1.67 Gbits/sec
    [ 4] 6.00-7.00 sec 163 MBytes 1.36 Gbits/sec
    [ 4] 7.00-8.00 sec 174 MBytes 1.46 Gbits/sec
    [ 4] 8.00-9.00 sec 193 MBytes 1.62 Gbits/sec
    [ 4] 9.00-10.00 sec 196 MBytes 1.65 Gbits/sec
    [ 4] 10.00-11.00 sec 157 MBytes 1.31 Gbits/sec
    [ 4] 11.00-12.00 sec 175 MBytes 1.47 Gbits/sec
    [ 4] 12.00-13.00 sec 192 MBytes 1.61 Gbits/sec
    [ 4] 13.00-14.00 sec 199 MBytes 1.67 Gbits/sec
    (etc)

    After patch:

    [root@Lab200slot2 ~]# iperf3 --sctp -4 -c 192.168.240.3 -V -l 1452 -t 60
    iperf version 3.0.1 (10 January 2014)
    Linux Lab200slot2 3.14.0+ #1 SMP Mon Apr 14 12:06:40 EDT 2014 x86_64
    Time: Mon, 14 Apr 2014 16:40:48 GMT
    Connecting to host 192.168.240.3, port 5201
    Cookie: Lab200slot2.1397493648.413274.65e131
    [ 4] local 192.168.240.2 port 50548 connected to 192.168.240.3 port 5201
    Starting Test: protocol: SCTP, 1 streams, 1452 byte blocks, omitting 0 seconds, 60 second test
    [ ID] Interval Transfer Bandwidth
    [ 4] 0.00-1.00 sec 240 MBytes 2.02 Gbits/sec
    [ 4] 1.00-2.00 sec 239 MBytes 2.01 Gbits/sec
    [ 4] 2.00-3.00 sec 240 MBytes 2.01 Gbits/sec
    [ 4] 3.00-4.00 sec 239 MBytes 2.00 Gbits/sec
    [ 4] 4.00-5.00 sec 245 MBytes 2.05 Gbits/sec
    [ 4] 5.00-6.00 sec 240 MBytes 2.01 Gbits/sec
    [ 4] 6.00-7.00 sec 240 MBytes 2.02 Gbits/sec
    [ 4] 7.00-8.00 sec 239 MBytes 2.01 Gbits/sec

    With the reverted patch applied, the SCTP/IPv4 performance is back
    to normal on latest upstream for IPv4 and IPv6 and has same throughput
    as 3.4.2 test kernel, steady and interval reports are smooth again.

    Fixes: ef2820a735f7 ("net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer")
    Reported-by: Peter Butler
    Reported-by: Dongsheng Song
    Reported-by: Fengguang Wu
    Tested-by: Peter Butler
    Signed-off-by: Daniel Borkmann
    Cc: Matija Glavinic Pecotic
    Cc: Alexander Sverdlin
    Cc: Vlad Yasevich
    Acked-by: Vlad Yasevich
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Francois reported that setting big mtu on loopback device could prevent
    tcp sessions making progress.

    We do not support (yet ?) IPv6 Jumbograms and cook corrupted packets.

    We must limit the IPv6 MTU to (65535 + 40) bytes in theory.

    Tested:

    ifconfig lo mtu 70000
    netperf -H ::1

    Before patch : Throughput : 0.05 Mbits

    After patch : Throughput : 35484 Mbits

    Reported-by: Francois WELLENREITER
    Signed-off-by: Eric Dumazet
    Acked-by: YOSHIFUJI Hideaki
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Eric Dumazet
     

14 Apr, 2014

1 commit

  • nft_cmp_fast is used for equality comparisions of size < 4 byte a mask is calculated that is applied to
    both the data from userspace (during initialization) and the register
    value (during runtime). Both values are stored using (in effect) memcpy
    to a memory area that is then interpreted as u32 by nft_cmp_fast.

    This works fine on little endian since smaller types have the same base
    address, however on big endian this is not true and the smaller types
    are interpreted as a big number with trailing zero bytes.

    The mask therefore must not include the lower bytes, but the higher bytes
    on big endian. Add a helper function that does a cpu_to_le32 to switch
    the bytes on big endian. Since we're dealing with a mask of just consequitive
    bits, this works out fine.

    Signed-off-by: Patrick McHardy
    Signed-off-by: Pablo Neira Ayuso

    Patrick McHardy
     

13 Apr, 2014

1 commit

  • Pull yet more networking updates from David Miller:

    1) Various fixes to the new Redpine Signals wireless driver, from
    Fariya Fatima.

    2) L2TP PPP connect code takes PMTU from the wrong socket, fix from
    Dmitry Petukhov.

    3) UFO and TSO packets differ in whether they include the protocol
    header in gso_size, account for that in skb_gso_transport_seglen().
    From Florian Westphal.

    4) If VLAN untagging fails, we double free the SKB in the bridging
    output path. From Toshiaki Makita.

    5) Several call sites of sk->sk_data_ready() were referencing an SKB
    just added to the socket receive queue in order to calculate the
    second argument via skb->len. This is dangerous because the moment
    the skb is added to the receive queue it can be consumed in another
    context and freed up.

    It turns out also that none of the sk->sk_data_ready()
    implementations even care about this second argument.

    So just kill it off and thus fix all these use-after-free bugs as a
    side effect.

    6) Fix inverted test in tcp_v6_send_response(), from Lorenzo Colitti.

    7) pktgen needs to do locking properly for LLTX devices, from Daniel
    Borkmann.

    8) xen-netfront driver initializes TX array entries in RX loop :-) From
    Vincenzo Maffione.

    9) After refactoring, some tunnel drivers allow a tunnel to be
    configured on top itself. Fix from Nicolas Dichtel.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (46 commits)
    vti: don't allow to add the same tunnel twice
    gre: don't allow to add the same tunnel twice
    drivers: net: xen-netfront: fix array initialization bug
    pktgen: be friendly to LLTX devices
    r8152: check RTL8152_UNPLUG
    net: sun4i-emac: add promiscuous support
    net/apne: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO
    net: ipv6: Fix oif in TCP SYN+ACK route lookup.
    drivers: net: cpsw: enable interrupts after napi enable and clearing previous interrupts
    drivers: net: cpsw: discard all packets received when interface is down
    net: Fix use after free by removing length arg from sk_data_ready callbacks.
    Drivers: net: hyperv: Address UDP checksum issues
    Drivers: net: hyperv: Negotiate suitable ndis version for offload support
    Drivers: net: hyperv: Allocate memory for all possible per-pecket information
    bridge: Fix double free and memory leak around br_allowed_ingress
    bonding: Remove debug_fs files when module init fails
    i40evf: program RSS LUT correctly
    i40evf: remove open-coded skb_cow_head
    ixgb: remove open-coded skb_cow_head
    igbvf: remove open-coded skb_cow_head
    ...

    Linus Torvalds
     

12 Apr, 2014

2 commits

  • Pull 9p changes from Eric Van Hensbergen:
    "A bunch of updates and cleanup within the transport layer,
    particularly with a focus on RDMA"

    * tag 'for-linus-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs:
    9pnet_rdma: check token type before int conversion
    9pnet: trans_fd : allocate struct p9_trans_fd and struct p9_conn together.
    9pnet: p9_client->conn field is unused. Remove it.
    9P: Get rid of REQ_STATUS_FLSH
    9pnet_rdma: add cancelled()
    9pnet_rdma: update request status during send
    9P: Add cancelled() to the transport functions.
    net: Mark function as static in 9p/client.c
    9P: Add memory barriers to protect request fields over cb/rpc threads handoff

    Linus Torvalds
     
  • Several spots in the kernel perform a sequence like:

    skb_queue_tail(&sk->s_receive_queue, skb);
    sk->sk_data_ready(sk, skb->len);

    But at the moment we place the SKB onto the socket receive queue it
    can be consumed and freed up. So this skb->len access is potentially
    to freed up memory.

    Furthermore, the skb->len can be modified by the consumer so it is
    possible that the value isn't accurate.

    And finally, no actual implementation of this callback actually uses
    the length argument. And since nobody actually cared about it's
    value, lots of call sites pass arbitrary values in such as '0' and
    even '1'.

    So just remove the length argument from the callback, that way there
    is no confusion whatsoever and all of these use-after-free cases get
    fixed as a side effect.

    Based upon a patch by Eric Dumazet and his suggestion to audit this
    issue tree-wide.

    Signed-off-by: David S. Miller

    David S. Miller
     

09 Apr, 2014

1 commit

  • Pull more networking updates from David Miller:

    1) If a VXLAN interface is created with no groups, we can crash on
    reception of packets. Fix from Mike Rapoport.

    2) Missing includes in CPTS driver, from Alexei Starovoitov.

    3) Fix string validations in isdnloop driver, from YOSHIFUJI Hideaki
    and Dan Carpenter.

    4) Missing irq.h include in bnxw2x, enic, and qlcnic drivers. From
    Josh Boyer.

    5) AF_PACKET transmit doesn't statistically count TX drops, from Daniel
    Borkmann.

    6) Byte-Queue-Limit enabled drivers aren't handled properly in
    AF_PACKET transmit path, also from Daniel Borkmann.

    Same problem exists in pktgen, and Daniel fixed it there too.

    7) Fix resource leaks in driver probe error paths of new sxgbe driver,
    from Francois Romieu.

    8) Truesize of SKBs can gradually get more and more corrupted in NAPI
    packet recycling path, fix from Eric Dumazet.

    9) Fix uniprocessor netfilter build, from Florian Westphal. In the
    longer term we should perhaps try to find a way for ARRAY_SIZE() to
    work even with zero sized array elements.

    10) Fix crash in netfilter conntrack extensions due to mis-estimation of
    required extension space. From Andrey Vagin.

    11) Since we commit table rule updates before trying to copy the
    counters back to userspace (it's the last action we perform), we
    really can't signal the user copy with an error as we are beyond the
    point from which we can unwind everything. This causes all kinds of
    use after free crashes and other mysterious behavior.

    From Thomas Graf.

    12) Restore previous behvaior of div/mod by zero in BPF filter
    processing. From Daniel Borkmann.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (38 commits)
    net: sctp: wake up all assocs if sndbuf policy is per socket
    isdnloop: several buffer overflows
    netdev: remove potentially harmful checks
    pktgen: fix xmit test for BQL enabled devices
    net/at91_ether: avoid NULL pointer dereference
    tipc: Let tipc_release() return 0
    at86rf230: fix MAX_CSMA_RETRIES parameter
    mac802154: fix duplicate #include headers
    sxgbe: fix duplicate #include headers
    net: filter: be more defensive on div/mod by X==0
    netfilter: Can't fail and free after table replacement
    xen-netback: Trivial format string fix
    net: bcmgenet: Remove unnecessary version.h inclusion
    net: smc911x: Remove unused local variable
    bonding: Inactive slaves should keep inactive flag's value
    netfilter: nf_tables: fix wrong format in request_module()
    netfilter: nf_tables: set names cannot be larger than 15 bytes
    netfilter: nf_conntrack: reserve two bytes for nf_ct_ext->len
    netfilter: Add {ipt,ip6t}_osf aliases for xt_osf
    netfilter: x_tables: allow to use cgroup match for LOCAL_IN nf hooks
    ...

    Linus Torvalds
     

04 Apr, 2014

2 commits

  • "len" contains sizeof(nf_ct_ext) and size of extensions. In a worst
    case it can contain all extensions. Bellow you can find sizes for all
    types of extensions. Their sum is definitely bigger than 256.

    nf_ct_ext_types[0]->len = 24
    nf_ct_ext_types[1]->len = 32
    nf_ct_ext_types[2]->len = 24
    nf_ct_ext_types[3]->len = 32
    nf_ct_ext_types[4]->len = 152
    nf_ct_ext_types[5]->len = 2
    nf_ct_ext_types[6]->len = 16
    nf_ct_ext_types[7]->len = 8

    I have seen "len" up to 280 and my host has crashes w/o this patch.

    The right way to fix this problem is reducing the size of the ecache
    extension (4) and Florian is going to do this, but these changes will
    be quite large to be appropriate for a stable tree.

    Fixes: 5b423f6a40a0 (netfilter: nf_conntrack: fix racy timer handling with reliable)
    Cc: Pablo Neira Ayuso
    Cc: Patrick McHardy
    Cc: Jozsef Kadlecsik
    Cc: "David S. Miller"
    Signed-off-by: Andrey Vagin
    Signed-off-by: Pablo Neira Ayuso

    Andrey Vagin
     
  • Pull cgroup updates from Tejun Heo:
    "A lot updates for cgroup:

    - The biggest one is cgroup's conversion to kernfs. cgroup took
    after the long abandoned vfs-entangled sysfs implementation and
    made it even more convoluted over time. cgroup's internal objects
    were fused with vfs objects which also brought in vfs locking and
    object lifetime rules. Naturally, there are places where vfs rules
    don't fit and nasty hacks, such as credential switching or lock
    dance interleaving inode mutex and cgroup_mutex with object serial
    number comparison thrown in to decide whether the operation is
    actually necessary, needed to be employed.

    After conversion to kernfs, internal object lifetime and locking
    rules are mostly isolated from vfs interactions allowing shedding
    of several nasty hacks and overall simplification. This will also
    allow implmentation of operations which may affect multiple cgroups
    which weren't possible before as it would have required nesting
    i_mutexes.

    - Various simplifications including dropping of module support,
    easier cgroup name/path handling, simplified cgroup file type
    handling and task_cg_lists optimization.

    - Prepatory changes for the planned unified hierarchy, which is still
    a patchset away from being actually operational. The dummy
    hierarchy is updated to serve as the default unified hierarchy.
    Controllers which aren't claimed by other hierarchies are
    associated with it, which BTW was what the dummy hierarchy was for
    anyway.

    - Various fixes from Li and others. This pull request includes some
    patches to add missing slab.h to various subsystems. This was
    triggered xattr.h include removal from cgroup.h. cgroup.h
    indirectly got included a lot of files which brought in xattr.h
    which brought in slab.h.

    There are several merge commits - one to pull in kernfs updates
    necessary for converting cgroup (already in upstream through
    driver-core), others for interfering changes in the fixes branch"

    * 'for-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (74 commits)
    cgroup: remove useless argument from cgroup_exit()
    cgroup: fix spurious lockdep warning in cgroup_exit()
    cgroup: Use RCU_INIT_POINTER(x, NULL) in cgroup.c
    cgroup: break kernfs active_ref protection in cgroup directory operations
    cgroup: fix cgroup_taskset walking order
    cgroup: implement CFTYPE_ONLY_ON_DFL
    cgroup: make cgrp_dfl_root mountable
    cgroup: drop const from @buffer of cftype->write_string()
    cgroup: rename cgroup_dummy_root and related names
    cgroup: move ->subsys_mask from cgroupfs_root to cgroup
    cgroup: treat cgroup_dummy_root as an equivalent hierarchy during rebinding
    cgroup: remove NULL checks from [pr_cont_]cgroup_{name|path}()
    cgroup: use cgroup_setup_root() to initialize cgroup_dummy_root
    cgroup: reorganize cgroup bootstrapping
    cgroup: relocate setting of CGRP_DEAD
    cpuset: use rcu_read_lock() to protect task_cs()
    cgroup_freezer: document freezer_fork() subtleties
    cgroup: update cgroup_transfer_tasks() to either succeed or fail
    cgroup: drop task_lock() protection around task->cgroups
    cgroup: update how a newly forked task gets associated with css_set
    ...

    Linus Torvalds
     

03 Apr, 2014

2 commits

  • Pull networking updates from David Miller:
    "Here is my initial pull request for the networking subsystem during
    this merge window:

    1) Support for ESN in AH (RFC 4302) from Fan Du.

    2) Add full kernel doc for ethtool command structures, from Ben
    Hutchings.

    3) Add BCM7xxx PHY driver, from Florian Fainelli.

    4) Export computed TCP rate information in netlink socket dumps, from
    Eric Dumazet.

    5) Allow IPSEC SA to be dumped partially using a filter, from Nicolas
    Dichtel.

    6) Convert many drivers to pci_enable_msix_range(), from Alexander
    Gordeev.

    7) Record SKB timestamps more efficiently, from Eric Dumazet.

    8) Switch to microsecond resolution for TCP round trip times, also
    from Eric Dumazet.

    9) Clean up and fix 6lowpan fragmentation handling by making use of
    the existing inet_frag api for it's implementation.

    10) Add TX grant mapping to xen-netback driver, from Zoltan Kiss.

    11) Auto size SKB lengths when composing netlink messages based upon
    past message sizes used, from Eric Dumazet.

    12) qdisc dumps can take a long time, add a cond_resched(), From Eric
    Dumazet.

    13) Sanitize netpoll core and drivers wrt. SKB handling semantics.
    Get rid of never-used-in-tree netpoll RX handling. From Eric W
    Biederman.

    14) Support inter-address-family and namespace changing in VTI tunnel
    driver(s). From Steffen Klassert.

    15) Add Altera TSE driver, from Vince Bridgers.

    16) Optimizing csum_replace2() so that it doesn't adjust the checksum
    by checksumming the entire header, from Eric Dumazet.

    17) Expand BPF internal implementation for faster interpreting, more
    direct translations into JIT'd code, and much cleaner uses of BPF
    filtering in non-socket ocntexts. From Daniel Borkmann and Alexei
    Starovoitov"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1976 commits)
    netpoll: Use skb_irq_freeable to make zap_completion_queue safe.
    net: Add a test to see if a skb is freeable in irq context
    qlcnic: Fix build failure due to undefined reference to `vxlan_get_rx_port'
    net: ptp: move PTP classifier in its own file
    net: sxgbe: make "core_ops" static
    net: sxgbe: fix logical vs bitwise operation
    net: sxgbe: sxgbe_mdio_register() frees the bus
    Call efx_set_channels() before efx->type->dimension_resources()
    xen-netback: disable rogue vif in kthread context
    net/mlx4: Set proper build dependancy with vxlan
    be2net: fix build dependency on VxLAN
    mac802154: make csma/cca parameters per-wpan
    mac802154: allow only one WPAN to be up at any given time
    net: filter: minor: fix kdoc in __sk_run_filter
    netlink: don't compare the nul-termination in nla_strcmp
    can: c_can: Avoid led toggling for every packet.
    can: c_can: Simplify TX interrupt cleanup
    can: c_can: Store dlc private
    can: c_can: Reduce register access
    can: c_can: Make the code readable
    ...

    Linus Torvalds
     
  • Pull trivial tree updates from Jiri Kosina:
    "Usual rocket science -- mostly documentation and comment updates"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
    sparse: fix comment
    doc: fix double words
    isdn: capi: fix "CAPI_VERSION" comment
    doc: DocBook: Fix typos in xml and template file
    Bluetooth: add module name for btwilink
    driver core: unexport static function create_syslog_header
    mmc: core: typo fix in printk specifier
    ARM: spear: clean up editing mistake
    net-sysfs: fix comment typo 'CONFIG_SYFS'
    doc: Insert MODULE_ in module-signing macros
    Documentation: update URL to hfsplus Technote 1150
    gpio: update path to documentation
    ixgbe: Fix format string in ixgbe_fcoe.
    Kconfig: Remove useless "default N" lines
    user_namespace.c: Remove duplicated word in comment
    CREDITS: fix formatting
    treewide: Fix typo in Documentation/DocBook
    mm: Fix warning on make htmldocs caused by slab.c
    ata: ata-samsung_cf: cleanup in header file
    idr: remove unused prototype of idr_free()

    Linus Torvalds
     

02 Apr, 2014

1 commit

  • Commit 9b2777d6089bcd (ieee802154: add TX power control to wpan_phy)
    and following erroneously added CSMA and CCA parameters for 802.15.4
    devices as PHY parameters, while they are actually MAC parameters and
    can differ for any two WPAN instances. Since it is now sensible to have
    multiple WPAN devices with differing CSMA/CCA parameters, make these
    parameters MAC parameters instead.

    Signed-off-by: Phoebe Buckheister
    Signed-off-by: David S. Miller

    Phoebe Buckheister
     

01 Apr, 2014

3 commits


31 Mar, 2014

1 commit

  • This patch basically does two things, i) removes the extern keyword
    from the include/linux/filter.h file to be more consistent with the
    rest of Joe's changes, and ii) moves filter accounting into the filter
    core framework.

    Filter accounting mainly done through sk_filter_{un,}charge() take
    care of the case when sockets are being cloned through sk_clone_lock()
    so that removal of the filter on one socket won't result in eviction
    as it's still referenced by the other.

    These functions actually belong to net/core/filter.c and not
    include/net/sock.h as we want to keep all that in a central place.
    It's also not in fast-path so uninlining them is fine and even allows
    us to get rd of sk_filter_release_rcu()'s EXPORT_SYMBOL and a forward
    declaration.

    Joint work with Alexei Starovoitov.

    Signed-off-by: Daniel Borkmann
    Signed-off-by: Alexei Starovoitov
    Cc: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

30 Mar, 2014

1 commit


29 Mar, 2014

1 commit

  • addrconf_join_solict and addrconf_join_anycast may cause actions which
    need rtnl locked, especially on first address creation.

    A new DAD state is introduced which defers processing of the initial
    DAD processing into a workqueue.

    To get rtnl lock we need to push the code paths which depend on those
    calls up to workqueues, specifically addrconf_verify and the DAD
    processing.

    (v2)
    addrconf_dad_failure needs to be queued up to the workqueue, too. This
    patch introduces a new DAD state and stop the DAD processing in the
    workqueue (this is because of the possible ipv6_del_addr processing
    which removes the solicited multicast address from the device).

    addrconf_verify_lock is removed, too. After the transition it is not
    needed any more.

    As we are not processing in bottom half anymore we need to be a bit more
    careful about disabling bottom half out when we lock spin_locks which are also
    used in bh.

    Relevant backtrace:
    [ 541.030090] RTNL: assertion failed at net/core/dev.c (4496)
    [ 541.031143] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G O 3.10.33-1-amd64-vyatta #1
    [ 541.031145] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
    [ 541.031146] ffffffff8148a9f0 000000000000002f ffffffff813c98c1 ffff88007c4451f8
    [ 541.031148] 0000000000000000 0000000000000000 ffffffff813d3540 ffff88007fc03d18
    [ 541.031150] 0000880000000006 ffff88007c445000 ffffffffa0194160 0000000000000000
    [ 541.031152] Call Trace:
    [ 541.031153] [] ? dump_stack+0xd/0x17
    [ 541.031180] [] ? __dev_set_promiscuity+0x101/0x180
    [ 541.031183] [] ? __hw_addr_create_ex+0x60/0xc0
    [ 541.031185] [] ? __dev_set_rx_mode+0xaa/0xc0
    [ 541.031189] [] ? __dev_mc_add+0x61/0x90
    [ 541.031198] [] ? igmp6_group_added+0xfc/0x1a0 [ipv6]
    [ 541.031208] [] ? kmem_cache_alloc+0xcb/0xd0
    [ 541.031212] [] ? ipv6_dev_mc_inc+0x267/0x300 [ipv6]
    [ 541.031216] [] ? addrconf_join_solict+0x2e/0x40 [ipv6]
    [ 541.031219] [] ? ipv6_dev_ac_inc+0x159/0x1f0 [ipv6]
    [ 541.031223] [] ? addrconf_join_anycast+0x92/0xa0 [ipv6]
    [ 541.031226] [] ? __ipv6_ifa_notify+0x11e/0x1e0 [ipv6]
    [ 541.031229] [] ? ipv6_ifa_notify+0x33/0x50 [ipv6]
    [ 541.031233] [] ? addrconf_dad_completed+0x28/0x100 [ipv6]
    [ 541.031241] [] ? task_cputime+0x2d/0x50
    [ 541.031244] [] ? addrconf_dad_timer+0x136/0x150 [ipv6]
    [ 541.031247] [] ? addrconf_dad_completed+0x100/0x100 [ipv6]
    [ 541.031255] [] ? call_timer_fn.isra.22+0x2a/0x90
    [ 541.031258] [] ? addrconf_dad_completed+0x100/0x100 [ipv6]

    Hunks and backtrace stolen from a patch by Stephen Hemminger.

    Reported-by: Stephen Hemminger
    Signed-off-by: Stephen Hemminger
    Signed-off-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Hannes Frederic Sowa
     

28 Mar, 2014

2 commits

  • If an IPv6 host route with metrics exists, an attempt to add a
    new route for the same target with different metrics fails but
    rewrites the metrics anyway:

    12sp0:~ # ip route add fec0::1 dev eth0 rto_min 1000
    12sp0:~ # ip -6 route show
    fe80::/64 dev eth0 proto kernel metric 256
    fec0::1 dev eth0 metric 1024 rto_min lock 1s
    12sp0:~ # ip route add fec0::1 dev eth0 rto_min 1500
    RTNETLINK answers: File exists
    12sp0:~ # ip -6 route show
    fe80::/64 dev eth0 proto kernel metric 256
    fec0::1 dev eth0 metric 1024 rto_min lock 1.5s

    This is caused by all IPv6 host routes using the metrics in
    their inetpeer (or the shared default). This also holds for the
    new route created in ip6_route_add() which shares the metrics
    with the already existing route and thus ip6_route_add()
    rewrites the metrics even if the new route ends up not being
    used at all.

    Another problem is that old metrics in inetpeer can reappear
    unexpectedly for a new route, e.g.

    12sp0:~ # ip route add fec0::1 dev eth0 rto_min 1000
    12sp0:~ # ip route del fec0::1
    12sp0:~ # ip route add fec0::1 dev eth0
    12sp0:~ # ip route change fec0::1 dev eth0 hoplimit 10
    12sp0:~ # ip -6 route show
    fe80::/64 dev eth0 proto kernel metric 256
    fec0::1 dev eth0 metric 1024 hoplimit 10 rto_min lock 1s

    Resolve the first problem by moving the setting of metrics down
    into fib6_add_rt2node() to the point we are sure we are
    inserting the new route into the tree. Second problem is
    addressed by introducing new flag DST_METRICS_FORCE_OVERWRITE
    which is set for a new host route in ip6_route_add() and makes
    ipv6_cow_metrics() always overwrite the metrics in inetpeer
    (even if they are not "new"); it is reset after that.

    v5: use a flag in _metrics member rather than one in flags

    v4: fix a typo making a condition always true (thanks to Hannes
    Frederic Sowa)

    v3: rewritten based on David Miller's idea to move setting the
    metrics (and allocation in non-host case) down to the point we
    already know the route is to be inserted. Also rebased to
    net-next as it is quite late in the cycle.

    Signed-off-by: Michal Kubecek
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Michal Kubeček
     
  • …etooth/bluetooth-next

    John W. Linville
     

27 Mar, 2014

1 commit

  • The packet hash can be considered a property of the packet, not just
    on RX path.

    This patch changes name of rxhash and l4_rxhash skbuff fields to be
    hash and l4_hash respectively. This includes changing uses of the
    field in the code which don't call the access functions.

    Signed-off-by: Tom Herbert
    Signed-off-by: Eric Dumazet
    Cc: Mahesh Bandewar
    Signed-off-by: David S. Miller

    Tom Herbert
     

26 Mar, 2014

6 commits

  • Conflicts:
    Documentation/devicetree/bindings/net/micrel-ks8851.txt
    net/core/netpoll.c

    The net/core/netpoll.c conflict is a bug fix in 'net' happening
    to code which is completely removed in 'net-next'.

    In micrel-ks8851.txt we simply have overlapping changes.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • John W. Linville says:

    ====================
    Please pull this batch of wireless updates intended for 3.15!

    For the mac80211 bits, Johannes says:

    "This has a whole bunch of bugfixes for things that went into -next
    previously as well as some other bugfixes I didn't want to rush into
    3.14 at this point. The rest of it is some cleanups and a few small
    features, the biggest of which is probably Janusz's regulatory DFS CAC
    time code."

    For the Bluetooth bits, Gustavo says:

    "One more pull request to 3.15. This is mostly and bug fix pull request, it
    contains several fixes and clean up all over the tree, plus some small new
    features."

    For the NFC bits, Samuel says:

    "This is the NFC pull request for 3.15. With this one we have:

    - Support for ISO 15693 a.k.a. NFC vicinity a.k.a. Type 5 tags. ISO
    15693 are long range (1 - 2 meters) vicinity tags/cards. The kernel
    now supports those through the NFC netlink and digital APIs.

    - Support for TI's trf7970a chipset. This chipset relies on the NFC
    digital layer and the driver currently supports type 2, 4A and 5 tags.

    - Support for NXP's pn544 secure firmare download. The pn544 C3 chipsets
    relies on a different firmware download protocal than the C2 one. We
    now support both and use the right one depending on the version we
    detect at runtime.

    - Support for 4A tags from the NFC digital layer.

    - A bunch of cleanups and minor fixes from Axel Lin and Thierry Escande."

    For the iwlwifi bits, Emmanuel says:

    "We were sending a host command while the mutex wasn't held. This
    led to hard-to-catch races."

    And...

    "I have a fix for a "merge damage" which is not really a merge
    damage: it enables scheduled scan which has been disabled in
    wireless.git. Since you merged wireless.git into wireless-next.git,
    this can now be fixed in wireless-next.git.

    Besides this, Alex made a workaround for a hardware bug. This fix
    allows us to consume less power in S3. Arik and Eliad continue to
    work on D0i3 which is a run-time power saving feature. Eliad also
    contributes a few bits to the rate scaling logic to which Eyal adds his
    own contribution. Avri dives deep in the power code - newer firmware
    will allow to enable power save in newer scenarios. Johannes made a few
    clean-ups. I have the regular amount of BT Coex boring stuff. I disable
    uAPSD since we identified firmware bugs that cause packet loss. One
    thing that do stand out is the udev event that we now send when the
    FW asserts. I hope it will allow us to debug the FW more easily."

    Also included is one last iwlwifi pull for a build breakage fix...

    For the Atheros bits, Kalle says:

    "Michal now did some optimisations and was able to improve throughput by
    100 Mbps on our MIPS based AP135 platform. Chun-Yeow added some
    workarounds to be able to better use ad-hoc mode. Ben improved log
    messages and added support for MSDU chaining. And, as usual, also some
    smaller fixes."

    Beyond that...

    Andrea Merello continues his rtl8180 refactoring, in preparation for
    a long-awaited rtl8187 driver. We get a new driver (rsi) for the
    RS9113 chip, from Fariya Fatima. And, of course, we get the usual
    round of updates for ath9k, brcmfmac, mwifiex, wil6210, etc. as well.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Signed-off-by: Simon Derr
    Signed-off-by: Eric Van Hensbergen

    Simon Derr
     
  • This request state is mostly useless, and properly implementing it
    for RDMA would require an extra lock to be taken in handle_recv()
    and in rdma_cancel() to avoid this race:

    handle_recv() rdma_cancel()
    . .
    . if req->state == SENT
    req->state = RCVD .
    . req->state = FLSH

    So just get rid of it.

    Signed-off-by: Simon Derr
    Signed-off-by: Eric Van Hensbergen

    Simon Derr
     
  • And move transport-specific code out of net/9p/client.c

    Signed-off-by: Simon Derr
    Signed-off-by: Eric Van Hensbergen

    Simon Derr
     
  • We need barriers to guarantee this pattern works as intended:
    [w] req->rc, 1 [r] req->status, 1
    wmb rmb
    [w] req->status, 1 [r] req->rc

    Where the wmb ensures that rc gets written before status,
    and the rmb ensures that if you observe status == 1, rc is the new value.

    Signed-off-by: Dominique Martinet
    Signed-off-by: Eric Van Hensbergen

    Dominique Martinet
     

25 Mar, 2014

1 commit


24 Mar, 2014

1 commit

  • When changing one 16bit value by another in IP header, we can adjust
    the IP checksum by doing a simple operation described in RFC 1624, as
    reminded by David.

    csum_partial() is a complex function on x86_64, not really suited for
    small number of checksummed bytes.

    I spotted csum_partial() being in the top 20 most consuming functions
    (more than 1 %) in a GRO workload, which was rather unexpected.

    The caller was inet_gro_complete() doing a csum_replace2() when
    building the new IP header for the GRO packet.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

22 Mar, 2014

2 commits


21 Mar, 2014

1 commit

  • While it is true that getnstimeofday() uses about 40 cycles if TSC
    is available, it can use 1600 cycles if hpet is the clocksource.

    Switch to get_jiffies_64(), as this is more than enough, and
    go back to 60 seconds periods.

    Fixes: 8c27bd75f04f ("tcp: syncookies: reduce cookie lifetime to 128 seconds")
    Signed-off-by: Eric Dumazet
    Cc: Florian Westphal
    Acked-by: Florian Westphal
    Signed-off-by: David S. Miller

    Eric Dumazet
     

20 Mar, 2014

3 commits