19 Oct, 2014

1 commit

  • Pull networking fixes from David Miller:

    1) Include fixes for netrom and dsa (Fabian Frederick and Florian
    Fainelli)

    2) Fix FIXED_PHY support in stmmac, from Giuseppe CAVALLARO.

    3) Several SKB use after free fixes (vxlan, openvswitch, vxlan,
    ip_tunnel, fou), from Li ROngQing.

    4) fec driver PTP support fixes from Luwei Zhou and Nimrod Andy.

    5) Use after free in virtio_net, from Michael S Tsirkin.

    6) Fix flow mask handling for megaflows in openvswitch, from Pravin B
    Shelar.

    7) ISDN gigaset and capi bug fixes from Tilman Schmidt.

    8) Fix route leak in ip_send_unicast_reply(), from Vasily Averin.

    9) Fix two eBPF JIT bugs on x86, from Alexei Starovoitov.

    10) TCP_SKB_CB() reorganization caused a few regressions, fixed by Cong
    Wang and Eric Dumazet.

    11) Don't overwrite end of SKB when parsing malformed sctp ASCONF
    chunks, from Daniel Borkmann.

    12) Don't call sock_kfree_s() with NULL pointers, this function also has
    the side effect of adjusting the socket memory usage. From Cong Wang.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (90 commits)
    bna: fix skb->truesize underestimation
    net: dsa: add includes for ethtool and phy_fixed definitions
    openvswitch: Set flow-key members.
    netrom: use linux/uaccess.h
    dsa: Fix conversion from host device to mii bus
    tipc: fix bug in bundled buffer reception
    ipv6: introduce tcp_v6_iif()
    sfc: add support for skb->xmit_more
    r8152: return -EBUSY for runtime suspend
    ipv4: fix a potential use after free in fou.c
    ipv4: fix a potential use after free in ip_tunnel_core.c
    hyperv: Add handling of IP header with option field in netvsc_set_hash()
    openvswitch: Create right mask with disabled megaflows
    vxlan: fix a free after use
    openvswitch: fix a use after free
    ipv4: dst_entry leak in ip_send_unicast_reply()
    ipv4: clean up cookie_v4_check()
    ipv4: share tcp_v4_save_options() with cookie_v4_check()
    ipv4: call __ip_options_echo() in cookie_v4_check()
    atm: simplify lanai.c by using module_pci_driver
    ...

    Linus Torvalds
     

15 Oct, 2014

2 commits

  • Pull percpu consistent-ops changes from Tejun Heo:
    "Way back, before the current percpu allocator was implemented, static
    and dynamic percpu memory areas were allocated and handled separately
    and had their own accessors. The distinction has been gone for many
    years now; however, the now duplicate two sets of accessors remained
    with the pointer based ones - this_cpu_*() - evolving various other
    operations over time. During the process, we also accumulated other
    inconsistent operations.

    This pull request contains Christoph's patches to clean up the
    duplicate accessor situation. __get_cpu_var() uses are replaced with
    with this_cpu_ptr() and __this_cpu_ptr() with raw_cpu_ptr().

    Unfortunately, the former sometimes is tricky thanks to C being a bit
    messy with the distinction between lvalues and pointers, which led to
    a rather ugly solution for cpumask_var_t involving the introduction of
    this_cpu_cpumask_var_ptr().

    This converts most of the uses but not all. Christoph will follow up
    with the remaining conversions in this merge window and hopefully
    remove the obsolete accessors"

    * 'for-3.18-consistent-ops' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (38 commits)
    irqchip: Properly fetch the per cpu offset
    percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t -fix
    ia64: sn_nodepda cannot be assigned to after this_cpu conversion. Use __this_cpu_write.
    percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t
    Revert "powerpc: Replace __get_cpu_var uses"
    percpu: Remove __this_cpu_ptr
    clocksource: Replace __this_cpu_ptr with raw_cpu_ptr
    sparc: Replace __get_cpu_var uses
    avr32: Replace __get_cpu_var with __this_cpu_write
    blackfin: Replace __get_cpu_var uses
    tile: Use this_cpu_ptr() for hardware counters
    tile: Replace __get_cpu_var uses
    powerpc: Replace __get_cpu_var uses
    alpha: Replace __get_cpu_var
    ia64: Replace __get_cpu_var uses
    s390: cio driver &__get_cpu_var replacements
    s390: Replace __get_cpu_var uses
    mips: Replace __get_cpu_var uses
    MIPS: Replace __get_cpu_var uses in FPU emulator.
    arm: Replace __this_cpu_ptr with raw_cpu_ptr
    ...

    Linus Torvalds
     
  • It is okay to free a NULL pointer but not okay to mischarge the socket optmem
    accounting. Compile test only.

    Reported-by: rucsoftsec@gmail.com
    Cc: Chien Yen
    Cc: Stephen Hemminger
    Signed-off-by: Cong Wang
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    Cong Wang
     

09 Oct, 2014

1 commit

  • Pull networking updates from David Miller:
    "Most notable changes in here:

    1) By far the biggest accomplishment, thanks to a large range of
    contributors, is the addition of multi-send for transmit. This is
    the result of discussions back in Chicago, and the hard work of
    several individuals.

    Now, when the ->ndo_start_xmit() method of a driver sees
    skb->xmit_more as true, it can choose to defer the doorbell
    telling the driver to start processing the new TX queue entires.

    skb->xmit_more means that the generic networking is guaranteed to
    call the driver immediately with another SKB to send.

    There is logic added to the qdisc layer to dequeue multiple
    packets at a time, and the handling mis-predicted offloads in
    software is now done with no locks held.

    Finally, pktgen is extended to have a "burst" parameter that can
    be used to test a multi-send implementation.

    Several drivers have xmit_more support: i40e, igb, ixgbe, mlx4,
    virtio_net

    Adding support is almost trivial, so export more drivers to
    support this optimization soon.

    I want to thank, in no particular or implied order, Jesper
    Dangaard Brouer, Eric Dumazet, Alexander Duyck, Tom Herbert, Jamal
    Hadi Salim, John Fastabend, Florian Westphal, Daniel Borkmann,
    David Tat, Hannes Frederic Sowa, and Rusty Russell.

    2) PTP and timestamping support in bnx2x, from Michal Kalderon.

    3) Allow adjusting the rx_copybreak threshold for a driver via
    ethtool, and add rx_copybreak support to enic driver. From
    Govindarajulu Varadarajan.

    4) Significant enhancements to the generic PHY layer and the bcm7xxx
    driver in particular (EEE support, auto power down, etc.) from
    Florian Fainelli.

    5) Allow raw buffers to be used for flow dissection, allowing drivers
    to determine the optimal "linear pull" size for devices that DMA
    into pools of pages. The objective is to get exactly the
    necessary amount of headers into the linear SKB area pre-pulled,
    but no more. The new interface drivers use is eth_get_headlen().
    From WANG Cong, with driver conversions (several had their own
    by-hand duplicated implementations) by Alexander Duyck and Eric
    Dumazet.

    6) Support checksumming more smoothly and efficiently for
    encapsulations, and add "foo over UDP" facility. From Tom
    Herbert.

    7) Add Broadcom SF2 switch driver to DSA layer, from Florian
    Fainelli.

    8) eBPF now can load programs via a system call and has an extensive
    testsuite. Alexei Starovoitov and Daniel Borkmann.

    9) Major overhaul of the packet scheduler to use RCU in several major
    areas such as the classifiers and rate estimators. From John
    Fastabend.

    10) Add driver for Intel FM10000 Ethernet Switch, from Alexander
    Duyck.

    11) Rearrange TCP_SKB_CB() to reduce cache line misses, from Eric
    Dumazet.

    12) Add Datacenter TCP congestion control algorithm support, From
    Florian Westphal.

    13) Reorganize sk_buff so that __copy_skb_header() is significantly
    faster. From Eric Dumazet"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1558 commits)
    netlabel: directly return netlbl_unlabel_genl_init()
    net: add netdev_txq_bql_{enqueue, complete}_prefetchw() helpers
    net: description of dma_cookie cause make xmldocs warning
    cxgb4: clean up a type issue
    cxgb4: potential shift wrapping bug
    i40e: skb->xmit_more support
    net: fs_enet: Add NAPI TX
    net: fs_enet: Remove non NAPI RX
    r8169:add support for RTL8168EP
    net_sched: copy exts->type in tcf_exts_change()
    wimax: convert printk to pr_foo()
    af_unix: remove 0 assignment on static
    ipv6: Do not warn for informational ICMP messages, regardless of type.
    Update Intel Ethernet Driver maintainers list
    bridge: Save frag_max_size between PRE_ROUTING and POST_ROUTING
    tipc: fix bug in multicast congestion handling
    net: better IFF_XMIT_DST_RELEASE support
    net/mlx4_en: remove NETDEV_TX_BUSY
    3c59x: fix bad split of cpu_to_le32(pci_map_single())
    net: bcmgenet: fix Tx ring priority programming
    ...

    Linus Torvalds
     

04 Oct, 2014

3 commits

  • I got a report of a double free happening at RDS slab cache. One
    suspicion was that may be somewhere we were doing a sock_hold/sock_put
    on an already freed sock. Thus after providing a kernel with the
    following change:

    static inline void sock_hold(struct sock *sk)
    {
    - atomic_inc(&sk->sk_refcnt);
    + if (!atomic_inc_not_zero(&sk->sk_refcnt))
    + WARN(1, "Trying to hold sock already gone: %p (family: %hd)\n",
    + sk, sk->sk_family);
    }

    The warning successfuly triggered:

    Trying to hold sock already gone: ffff81f6dda61280 (family: 21)
    WARNING: at include/net/sock.h:350 sock_hold()
    Call Trace:
    [] :rds:rds_send_remove_from_sock+0xf0/0x21b
    [] :rds:rds_send_drop_acked+0xbf/0xcf
    [] :rds_rdma:rds_ib_recv_tasklet_fn+0x256/0x2dc
    [] tasklet_action+0x8f/0x12b
    [] __do_softirq+0x89/0x133
    [] call_softirq+0x1c/0x28
    [] do_softirq+0x2c/0x7d
    [] do_IRQ+0xee/0xf7
    [] ret_from_intr+0x0/0xa

    Looking at the call chain above, the only way I think this would be
    possible is if somewhere we already released the same socket->sock which
    is assigned to the rds_message at rds_send_remove_from_sock. Which seems
    only possible to happen after the tear down done on rds_release.

    rds_release properly calls rds_send_drop_to to drop the socket from any
    rds_message, and some proper synchronization is in place to avoid race
    with rds_send_drop_acked/rds_send_remove_from_sock. However, I still see
    a very narrow window where it may be possible we touch a sock already
    released: when rds_release races with rds_send_drop_acked, we check
    RDS_MSG_ON_CONN to avoid cleanup on the same rds_message, but in this
    specific case we don't clear rm->m_rs. In this case, it seems we could
    then go on at rds_send_drop_to and after it returns, the sock is freed
    by last sock_put on rds_release, with concurrently we being at
    rds_send_remove_from_sock; then at some point in the loop at
    rds_send_remove_from_sock we process an rds_message which didn't have
    rm->m_rs unset for a freed sock, and a possible sock_hold on an sock
    already gone at rds_release happens.

    This hopefully address the described condition above and avoids a double
    free on "second last" sock_put. In addition, I removed the comment about
    socket destruction on top of rds_send_drop_acked: we call rds_send_drop_to
    in rds_release and we should have things properly serialized there, thus
    I can't see the comment being accurate there.

    Signed-off-by: Herton R. Krzesinski
    Signed-off-by: David S. Miller

    Herton R. Krzesinski
     
  • I see two problems if we consider the sock->ops->connect attempt to fail in
    rds_tcp_conn_connect. The first issue is that for example we don't remove the
    previously added rds_tcp_connection item to rds_tcp_tc_list at
    rds_tcp_set_callbacks, which means that on a next reconnect attempt for the
    same rds_connection, when rds_tcp_conn_connect is called we can again call
    rds_tcp_set_callbacks, resulting in duplicated items on rds_tcp_tc_list,
    leading to list corruption: to avoid this just make sure we call
    properly rds_tcp_restore_callbacks before we exit. The second issue
    is that we should also release the sock properly, by setting sock = NULL
    only if we are returning without error.

    Signed-off-by: Herton R. Krzesinski
    Signed-off-by: David S. Miller

    Herton R. Krzesinski
     
  • Signed-off-by: Herton R. Krzesinski
    Signed-off-by: David S. Miller

    Herton R. Krzesinski
     

28 Aug, 2014

1 commit


27 Aug, 2014

1 commit


13 Jun, 2014

1 commit

  • Pull networking updates from David Miller:

    1) Seccomp BPF filters can now be JIT'd, from Alexei Starovoitov.

    2) Multiqueue support in xen-netback and xen-netfront, from Andrew J
    Benniston.

    3) Allow tweaking of aggregation settings in cdc_ncm driver, from Bjørn
    Mork.

    4) BPF now has a "random" opcode, from Chema Gonzalez.

    5) Add more BPF documentation and improve test framework, from Daniel
    Borkmann.

    6) Support TCP fastopen over ipv6, from Daniel Lee.

    7) Add software TSO helper functions and use them to support software
    TSO in mvneta and mv643xx_eth drivers. From Ezequiel Garcia.

    8) Support software TSO in fec driver too, from Nimrod Andy.

    9) Add Broadcom SYSTEMPORT driver, from Florian Fainelli.

    10) Handle broadcasts more gracefully over macvlan when there are large
    numbers of interfaces configured, from Herbert Xu.

    11) Allow more control over fwmark used for non-socket based responses,
    from Lorenzo Colitti.

    12) Do TCP congestion window limiting based upon measurements, from Neal
    Cardwell.

    13) Support busy polling in SCTP, from Neal Horman.

    14) Allow RSS key to be configured via ethtool, from Venkata Duvvuru.

    15) Bridge promisc mode handling improvements from Vlad Yasevich.

    16) Don't use inetpeer entries to implement ID generation any more, it
    performs poorly, from Eric Dumazet.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1522 commits)
    rtnetlink: fix userspace API breakage for iproute2 < v3.9.0
    tcp: fixing TLP's FIN recovery
    net: fec: Add software TSO support
    net: fec: Add Scatter/gather support
    net: fec: Increase buffer descriptor entry number
    net: fec: Factorize feature setting
    net: fec: Enable IP header hardware checksum
    net: fec: Factorize the .xmit transmit function
    bridge: fix compile error when compiling without IPv6 support
    bridge: fix smatch warning / potential null pointer dereference
    via-rhine: fix full-duplex with autoneg disable
    bnx2x: Enlarge the dorq threshold for VFs
    bnx2x: Check for UNDI in uncommon branch
    bnx2x: Fix 1G-baseT link
    bnx2x: Fix link for KR with swapped polarity lane
    sctp: Fix sk_ack_backlog wrap-around problem
    net/core: Add VF link state control policy
    net/fsl: xgmac_mdio is dependent on OF_MDIO
    net/fsl: Make xgmac_mdio read error message useful
    net_sched: drr: warn when qdisc is not work conserving
    ...

    Linus Torvalds
     

31 May, 2014

2 commits

  • This patch replaces a comma between expression statements by a semicolon.

    A simplified version of the semantic patch that performs this
    transformation is as follows:

    //
    @r@
    expression e1,e2,e;
    type T;
    identifier i;
    @@

    e1
    -,
    +;
    e2;
    //

    Signed-off-by: Himangi Saraogi
    Acked-by: Julia Lawall
    Signed-off-by: David S. Miller

    Himangi Saraogi
     
  • This patch replaces a comma between expression statements by a semicolon.

    A simplified version of the semantic patch that performs this
    transformation is as follows:

    //
    @r@
    expression e1,e2,e;
    type T;
    identifier i;
    @@

    e1
    -,
    +;
    e2;
    //

    Signed-off-by: Himangi Saraogi
    Acked-by: Julia Lawall
    Signed-off-by: David S. Miller

    Himangi Saraogi
     

19 May, 2014

1 commit


10 May, 2014

1 commit


18 Apr, 2014

1 commit

  • Mostly scripted conversion of the smp_mb__* barriers.

    Signed-off-by: Peter Zijlstra
    Acked-by: Paul E. McKenney
    Link: http://lkml.kernel.org/n/tip-55dhyhocezdw1dg7u19hmh1u@git.kernel.org
    Cc: Linus Torvalds
    Cc: linux-arch@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

12 Apr, 2014

1 commit

  • Several spots in the kernel perform a sequence like:

    skb_queue_tail(&sk->s_receive_queue, skb);
    sk->sk_data_ready(sk, skb->len);

    But at the moment we place the SKB onto the socket receive queue it
    can be consumed and freed up. So this skb->len access is potentially
    to freed up memory.

    Furthermore, the skb->len can be modified by the consumer so it is
    possible that the value isn't accurate.

    And finally, no actual implementation of this callback actually uses
    the length argument. And since nobody actually cared about it's
    value, lots of call sites pass arbitrary values in such as '0' and
    even '1'.

    So just remove the length argument from the callback, that way there
    is no confusion whatsoever and all of these use-after-free cases get
    fixed as a side effect.

    Based upon a patch by Eric Dumazet and his suggestion to audit this
    issue tree-wide.

    Signed-off-by: David S. Miller

    David S. Miller
     

01 Apr, 2014

1 commit


19 Jan, 2014

1 commit

  • This is a follow-up patch to f3d3342602f8bc ("net: rework recvmsg
    handler msg_name and msg_namelen logic").

    DECLARE_SOCKADDR validates that the structure we use for writing the
    name information to is not larger than the buffer which is reserved
    for msg->msg_name (which is 128 bytes). Also use DECLARE_SOCKADDR
    consistently in sendmsg code paths.

    Signed-off-by: Steffen Hurrle
    Suggested-by: Hannes Frederic Sowa
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Steffen Hurrle
     

18 Jan, 2014

2 commits

  • Conflicts:
    drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
    net/ipv4/tcp_metrics.c

    Overlapping changes between the "don't create two tcp metrics objects
    with the same key" race fix in net and the addition of the destination
    address in the lookup key in net-next.

    Minor overlapping changes in bnx2x driver.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • commit ae4b46e9d "net: rds: use this_cpu_* per-cpu helper" broke per-cpu
    handling for rds. chpfirst is the result of __this_cpu_read(), so it is
    an absolute pointer and not __percpu. Therefore, __this_cpu_write()
    should not operate on chpfirst, but rather on cache->percpu->first, just
    like __this_cpu_read() did before.

    Cc: # 3.8+
    Signed-off-byd Gerald Schaefer

    Signed-off-by: David S. Miller

    Gerald Schaefer
     

15 Jan, 2014

1 commit


28 Dec, 2013

1 commit

  • Binding might result in a NULL device, which is dereferenced
    causing this BUG:

    [ 1317.260548] BUG: unable to handle kernel NULL pointer dereference at 000000000000097
    4
    [ 1317.261847] IP: [] rds_ib_laddr_check+0x82/0x110
    [ 1317.263315] PGD 418bcb067 PUD 3ceb21067 PMD 0
    [ 1317.263502] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
    [ 1317.264179] Dumping ftrace buffer:
    [ 1317.264774] (ftrace buffer empty)
    [ 1317.265220] Modules linked in:
    [ 1317.265824] CPU: 4 PID: 836 Comm: trinity-child46 Tainted: G W 3.13.0-rc4-
    next-20131218-sasha-00013-g2cebb9b-dirty #4159
    [ 1317.267415] task: ffff8803ddf33000 ti: ffff8803cd31a000 task.ti: ffff8803cd31a000
    [ 1317.268399] RIP: 0010:[] [] rds_ib_laddr_check+
    0x82/0x110
    [ 1317.269670] RSP: 0000:ffff8803cd31bdf8 EFLAGS: 00010246
    [ 1317.270230] RAX: 0000000000000000 RBX: ffff88020b0dd388 RCX: 0000000000000000
    [ 1317.270230] RDX: ffffffff8439822e RSI: 00000000000c000a RDI: 0000000000000286
    [ 1317.270230] RBP: ffff8803cd31be38 R08: 0000000000000000 R09: 0000000000000000
    [ 1317.270230] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
    [ 1317.270230] R13: 0000000054086700 R14: 0000000000a25de0 R15: 0000000000000031
    [ 1317.270230] FS: 00007ff40251d700(0000) GS:ffff88022e200000(0000) knlGS:000000000000
    0000
    [ 1317.270230] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    [ 1317.270230] CR2: 0000000000000974 CR3: 00000003cd478000 CR4: 00000000000006e0
    [ 1317.270230] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [ 1317.270230] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000090602
    [ 1317.270230] Stack:
    [ 1317.270230] 0000000054086700 5408670000a25de0 5408670000000002 0000000000000000
    [ 1317.270230] ffffffff84223542 00000000ea54c767 0000000000000000 ffffffff86d26160
    [ 1317.270230] ffff8803cd31be68 ffffffff84223556 ffff8803cd31beb8 ffff8800c6765280
    [ 1317.270230] Call Trace:
    [ 1317.270230] [] ? rds_trans_get_preferred+0x42/0xa0
    [ 1317.270230] [] rds_trans_get_preferred+0x56/0xa0
    [ 1317.270230] [] rds_bind+0x73/0xf0
    [ 1317.270230] [] SYSC_bind+0x92/0xf0
    [ 1317.270230] [] ? context_tracking_user_exit+0xb8/0x1d0
    [ 1317.270230] [] ? trace_hardirqs_on+0xd/0x10
    [ 1317.270230] [] ? syscall_trace_enter+0x32/0x290
    [ 1317.270230] [] SyS_bind+0xe/0x10
    [ 1317.270230] [] tracesys+0xdd/0xe2
    [ 1317.270230] Code: 00 8b 45 cc 48 8d 75 d0 48 c7 45 d8 00 00 00 00 66 c7 45 d0 02 00
    89 45 d4 48 89 df e8 78 49 76 ff 41 89 c4 85 c0 75 0c 48 8b 03 b8 74 09 00 00 01 7
    4 06 41 bc 9d ff ff ff f6 05 2a b6 c2 02
    [ 1317.270230] RIP [] rds_ib_laddr_check+0x82/0x110
    [ 1317.270230] RSP
    [ 1317.270230] CR2: 0000000000000974

    Signed-off-by: Sasha Levin
    Signed-off-by: David S. Miller

    Sasha Levin
     

04 Dec, 2013

1 commit

  • After congestion update on a local connection, when rds_ib_xmit returns
    less bytes than that are there in the message, rds_send_xmit calls
    back rds_ib_xmit with an offset that causes BUG_ON(off & RDS_FRAG_SIZE)
    to trigger.

    For a 4Kb PAGE_SIZE rds_ib_xmit returns min(8240,4096)=4096 when actually
    the message contains 8240 bytes. rds_send_xmit thinks there is more to send
    and calls rds_ib_xmit again with a data offset "off" of 4096-48(rds header)
    =4048 bytes thus hitting the BUG_ON(off & RDS_FRAG_SIZE) [RDS_FRAG_SIZE=4k].

    The commit 6094628bfd94323fc1cea05ec2c6affd98c18f7f
    "rds: prevent BUG_ON triggering on congestion map updates" introduced
    this regression. That change was addressing the triggering of a different
    BUG_ON in rds_send_xmit() on PowerPC architecture with 64Kbytes PAGE_SIZE:
    BUG_ON(ret != 0 &&
    conn->c_xmit_sg == rm->data.op_nents);
    This was the sequence it was going through:
    (rds_ib_xmit)
    /* Do not send cong updates to IB loopback */
    if (conn->c_loopback
    && rm->m_inc.i_hdr.h_flags & RDS_FLAG_CONG_BITMAP) {
    rds_cong_map_updated(conn->c_fcong, ~(u64) 0);
    return sizeof(struct rds_header) + RDS_CONG_MAP_BYTES;
    }
    rds_ib_xmit returns 8240
    rds_send_xmit:
    c_xmit_data_off = 0 + 8240 - 48 (rds header accounted only the first time)
    = 8192
    c_xmit_data_off < 65536 (sg->length), so calls rds_ib_xmit again
    rds_ib_xmit returns 8240
    rds_send_xmit:
    c_xmit_data_off = 8192 + 8240 = 16432, calls rds_ib_xmit again
    and so on (c_xmit_data_off 24672,32912,41152,49392,57632)
    rds_ib_xmit returns 8240
    On this iteration this sequence causes the BUG_ON in rds_send_xmit:
    while (ret) {
    tmp = min_t(int, ret, sg->length - conn->c_xmit_data_off);
    [tmp = 65536 - 57632 = 7904]
    conn->c_xmit_data_off += tmp;
    [c_xmit_data_off = 57632 + 7904 = 65536]
    ret -= tmp;
    [ret = 8240 - 7904 = 336]
    if (conn->c_xmit_data_off == sg->length) {
    conn->c_xmit_data_off = 0;
    sg++;
    conn->c_xmit_sg++;
    BUG_ON(ret != 0 &&
    conn->c_xmit_sg == rm->data.op_nents);
    [c_xmit_sg = 1, rm->data.op_nents = 1]

    What the current fix does:
    Since the congestion update over loopback is not actually transmitted
    as a message, all that rds_ib_xmit needs to do is let the caller think
    the full message has been transmitted and not return partial bytes.
    It will return 8240 (RDS_CONG_MAP_BYTES+48) when PAGE_SIZE is 4Kb.
    And 64Kb+48 when page size is 64Kb.

    Reported-by: Josh Hunt
    Tested-by: Honggang Li
    Acked-by: Bang Nguyen
    Signed-off-by: Venkat Venkatsubra
    Signed-off-by: David S. Miller

    Venkat Venkatsubra
     

21 Nov, 2013

1 commit


20 Oct, 2013

3 commits

  • Initialize the ehash and ipv6_hash_secrets with net_get_random_once.

    Each compilation unit gets its own secret now:
    ipv4/inet_hashtables.o
    ipv4/udp.o
    ipv6/inet6_hashtables.o
    ipv6/udp.o
    rds/connection.o

    The functions still get inlined into the hashing functions. In the fast
    path we have at most two (needed in ipv6) if (unlikely(...)).

    Cc: Eric Dumazet
    Cc: "David S. Miller"
    Signed-off-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Hannes Frederic Sowa
     
  • This duplicates a bit of code but let's us easily introduce
    separate secret keys later. The separate compilation units are
    ipv4/inet_hashtabbles.o, ipv4/udp.o and rds/connection.o.

    Cc: Eric Dumazet
    Cc: "David S. Miller"
    Signed-off-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Hannes Frederic Sowa
     
  • There are a mix of function prototypes with and without extern
    in the kernel sources. Standardize on not using extern for
    function prototypes.

    Function prototypes don't need to be written with extern.
    extern is assumed by the compiler. Its use is as unnecessary as
    using auto to declare automatic/local variables in a block.

    Signed-off-by: Joe Perches
    Signed-off-by: David S. Miller

    Joe Perches
     

13 Jun, 2013

1 commit

  • Reduce the uses of this unnecessary typedef.

    Done via perl script:

    $ git grep --name-only -w ctl_table net | \
    xargs perl -p -i -e '\
    sub trim { my ($local) = @_; $local =~ s/(^\s+|\s+$)//g; return $local; } \
    s/\b(?<!struct\s)ctl_table\b(\s*\*\s*|\s+\w+)/"struct ctl_table " . trim($1)/ge'

    Reflow the modified lines that now exceed 80 columns.

    Signed-off-by: Joe Perches
    Signed-off-by: David S. Miller

    Joe Perches
     

08 Mar, 2013

1 commit

  • for NUL terminated string, need be always sure '\0' in the end.

    additional info:
    strncpy will pads with zeroes to the end of the given buffer.
    should initialise every bit of memory that is going to be copied to userland

    Signed-off-by: Chen Gang
    Signed-off-by: Ben Hutchings
    Signed-off-by: David S. Miller

    Chen Gang
     

06 Mar, 2013

1 commit

  • Pull networking fixes from David Miller:
    "A moderately sized pile of fixes, some specifically for merge window
    introduced regressions although others are for longer standing items
    and have been queued up for -stable.

    I'm kind of tired of all the RDS protocol bugs over the years, to be
    honest, it's way out of proportion to the number of people who
    actually use it.

    1) Fix missing range initialization in netfilter IPSET, from Jozsef
    Kadlecsik.

    2) ieee80211_local->tim_lock needs to use BH disabling, from Johannes
    Berg.

    3) Fix DMA syncing in SFC driver, from Ben Hutchings.

    4) Fix regression in BOND device MAC address setting, from Jiri
    Pirko.

    5) Missing usb_free_urb in ISDN Hisax driver, from Marina Makienko.

    6) Fix UDP checksumming in bnx2x driver for 57710 and 57711 chips,
    fix from Dmitry Kravkov.

    7) Missing cfgspace_lock initialization in BCMA driver.

    8) Validate parameter size for SCTP assoc stats getsockopt(), from
    Guenter Roeck.

    9) Fix SCTP association hangs, from Lee A Roberts.

    10) Fix jumbo frame handling in r8169, from Francois Romieu.

    11) Fix phy_device memory leak, from Petr Malat.

    12) Omit trailing FCS from frames received in BGMAC driver, from Hauke
    Mehrtens.

    13) Missing socket refcount release in L2TP, from Guillaume Nault.

    14) sctp_endpoint_init should respect passed in gfp_t, rather than use
    GFP_KERNEL unconditionally. From Dan Carpenter.

    15) Add AISX AX88179 USB driver, from Freddy Xin.

    16) Remove MAINTAINERS entries for drivers deleted during the merge
    window, from Cesar Eduardo Barros.

    17) RDS protocol can try to allocate huge amounts of memory, check
    that the user's request length makes sense, from Cong Wang.

    18) SCTP should use the provided KMALLOC_MAX_SIZE instead of it's own,
    bogus, definition. From Cong Wang.

    19) Fix deadlocks in FEC driver by moving TX reclaim into NAPI poll,
    from Frank Li. Also, fix a build error introduced in the merge
    window.

    20) Fix bogus purging of default routes in ipv6, from Lorenzo Colitti.

    21) Don't double count RTT measurements when we leave the TCP receive
    fast path, from Neal Cardwell."

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (61 commits)
    tcp: fix double-counted receiver RTT when leaving receiver fast path
    CAIF: fix sparse warning for caif_usb
    rds: simplify a warning message
    net: fec: fix build error in no MXC platform
    net: ipv6: Don't purge default router if accept_ra=2
    net: fec: put tx to napi poll function to fix dead lock
    sctp: use KMALLOC_MAX_SIZE instead of its own MAX_KMALLOC_SIZE
    rds: limit the size allocated by rds_message_alloc()
    MAINTAINERS: remove eexpress
    MAINTAINERS: remove drivers/net/wan/cycx*
    MAINTAINERS: remove 3c505
    caif_dev: fix sparse warnings for caif_flow_cb
    ax88179_178a: ASIX AX88179_178A USB 3.0/2.0 to gigabit ethernet adapter driver
    sctp: use the passed in gfp flags instead GFP_KERNEL
    ipv[4|6]: correct dropwatch false positive in local_deliver_finish
    l2tp: Restore socket refcount when sendmsg succeeds
    net/phy: micrel: Disable asymmetric pause for KSZ9021
    bgmac: omit the fcs
    phy: Fix phy_device_free memory leak
    bnx2x: Fix KR2 work-around condition
    ...

    Linus Torvalds
     

05 Mar, 2013

2 commits

  • Cc: David S. Miller
    Cc: Venkat Venkatsubra
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    Cong Wang
     
  • Dave Jones reported the following bug:

    "When fed mangled socket data, rds will trust what userspace gives it,
    and tries to allocate enormous amounts of memory larger than what
    kmalloc can satisfy."

    WARNING: at mm/page_alloc.c:2393 __alloc_pages_nodemask+0xa0d/0xbe0()
    Hardware name: GA-MA78GM-S2H
    Modules linked in: vmw_vsock_vmci_transport vmw_vmci vsock fuse bnep dlci bridge 8021q garp stp mrp binfmt_misc l2tp_ppp l2tp_core rfcomm s
    Pid: 24652, comm: trinity-child2 Not tainted 3.8.0+ #65
    Call Trace:
    [] warn_slowpath_common+0x75/0xa0
    [] warn_slowpath_null+0x1a/0x20
    [] __alloc_pages_nodemask+0xa0d/0xbe0
    [] ? native_sched_clock+0x26/0x90
    [] ? trace_hardirqs_off_caller+0x28/0xc0
    [] ? trace_hardirqs_off+0xd/0x10
    [] alloc_pages_current+0xb8/0x180
    [] __get_free_pages+0x2a/0x80
    [] kmalloc_order_trace+0x3e/0x1a0
    [] __kmalloc+0x2f5/0x3a0
    [] ? local_bh_enable_ip+0x7c/0xf0
    [] rds_message_alloc+0x23/0xb0 [rds]
    [] rds_sendmsg+0x2b1/0x990 [rds]
    [] ? trace_hardirqs_off+0xd/0x10
    [] sock_sendmsg+0xb0/0xe0
    [] ? get_lock_stats+0x22/0x70
    [] ? put_lock_stats.isra.23+0xe/0x40
    [] sys_sendto+0x130/0x180
    [] ? trace_hardirqs_on+0xd/0x10
    [] ? _raw_spin_unlock_irq+0x3b/0x60
    [] ? sysret_check+0x1b/0x56
    [] ? trace_hardirqs_on_caller+0x115/0x1a0
    [] ? trace_hardirqs_on_thunk+0x3a/0x3f
    [] system_call_fastpath+0x16/0x1b
    ---[ end trace eed6ae990d018c8b ]---

    Reported-by: Dave Jones
    Cc: Dave Jones
    Cc: David S. Miller
    Cc: Venkat Venkatsubra
    Signed-off-by: Cong Wang
    Acked-by: Venkat Venkatsubra
    Signed-off-by: David S. Miller

    Cong Wang
     

28 Feb, 2013

1 commit

  • I'm not sure why, but the hlist for each entry iterators were conceived

    list_for_each_entry(pos, head, member)

    The hlist ones were greedy and wanted an extra parameter:

    hlist_for_each_entry(tpos, pos, head, member)

    Why did they need an extra pos parameter? I'm not quite sure. Not only
    they don't really need it, it also prevents the iterator from looking
    exactly like the list iterator, which is unfortunate.

    Besides the semantic patch, there was some manual work required:

    - Fix up the actual hlist iterators in linux/list.h
    - Fix up the declaration of other iterators based on the hlist ones.
    - A very small amount of places were using the 'node' parameter, this
    was modified to use 'obj->member' instead.
    - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
    properly, so those had to be fixed up manually.

    The semantic patch which is mostly the work of Peter Senna Tschudin is here:

    @@
    iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;

    type T;
    expression a,c,d,e;
    identifier b;
    statement S;
    @@

    -T b;

    [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
    [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
    [akpm@linux-foundation.org: checkpatch fixes]
    [akpm@linux-foundation.org: fix warnings]
    [akpm@linux-foudnation.org: redo intrusive kvm changes]
    Tested-by: Peter Senna Tschudin
    Acked-by: Paul E. McKenney
    Signed-off-by: Sasha Levin
    Cc: Wu Fengguang
    Cc: Marcelo Tosatti
    Cc: Gleb Natapov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     

12 Jan, 2013

1 commit

  • The CONFIG_EXPERIMENTAL config item has not carried much meaning for a
    while now and is almost always enabled by default. As agreed during the
    Linux kernel summit, remove it from any "depends on" lines in Kconfigs.

    CC: Venkat Venkatsubra
    CC: "David S. Miller"
    Signed-off-by: Kees Cook
    Acked-by: Venkat Venkatsubra
    Acked-by: David S. Miller

    Kees Cook
     

27 Dec, 2012

2 commits


20 Nov, 2012

1 commit


10 Oct, 2012

1 commit

  • This is the revised patch for fixing rds-ping spinlock recursion
    according to Venkat's suggestions.

    RDS ping/pong over TCP feature has been broken for years(2.6.39 to
    3.6.0) since we have to set TCP cork and call kernel_sendmsg() between
    ping/pong which both need to lock "struct sock *sk". However, this
    lock has already been hold before rds_tcp_data_ready() callback is
    triggerred. As a result, we always facing spinlock resursion which
    would resulting in system panic.

    Given that RDS ping is only used to test the connectivity and not for
    serious performance measurements, we can queue the pong transmit to
    rds_wq as a delayed response.

    Reported-by: Dan Carpenter
    CC: Venkat Venkatsubra
    CC: David S. Miller
    CC: James Morris
    Signed-off-by: Jie Liu
    Signed-off-by: David S. Miller

    jeff.liu
     

23 Aug, 2012

1 commit


23 Jul, 2012

1 commit

  • Jay Fenlason (fenlason@redhat.com) found a bug,
    that recvfrom() on an RDS socket can return the contents of random kernel
    memory to userspace if it was called with a address length larger than
    sizeof(struct sockaddr_in).
    rds_recvmsg() also fails to set the addr_len paramater properly before
    returning, but that's just a bug.
    There are also a number of cases wher recvfrom() can return an entirely bogus
    address. Anything in rds_recvmsg() that returns a non-negative value but does
    not go through the "sin = (struct sockaddr_in *)msg->msg_name;" code path
    at the end of the while(1) loop will return up to 128 bytes of kernel memory
    to userspace.

    And I write two test programs to reproduce this bug, you will see that in
    rds_server, fromAddr will be overwritten and the following sock_fd will be
    destroyed.
    Yes, it is the programmer's fault to set msg_namelen incorrectly, but it is
    better to make the kernel copy the real length of address to user space in
    such case.

    How to run the test programs ?
    I test them on 32bit x86 system, 3.5.0-rc7.

    1 compile
    gcc -o rds_client rds_client.c
    gcc -o rds_server rds_server.c

    2 run ./rds_server on one console

    3 run ./rds_client on another console

    4 you will see something like:
    server is waiting to receive data...
    old socket fd=3
    server received data from client:data from client
    msg.msg_namelen=32
    new socket fd=-1067277685
    sendmsg()
    : Bad file descriptor

    /***************** rds_client.c ********************/

    int main(void)
    {
    int sock_fd;
    struct sockaddr_in serverAddr;
    struct sockaddr_in toAddr;
    char recvBuffer[128] = "data from client";
    struct msghdr msg;
    struct iovec iov;

    sock_fd = socket(AF_RDS, SOCK_SEQPACKET, 0);
    if (sock_fd < 0) {
    perror("create socket error\n");
    exit(1);
    }

    memset(&serverAddr, 0, sizeof(serverAddr));
    serverAddr.sin_family = AF_INET;
    serverAddr.sin_addr.s_addr = inet_addr("127.0.0.1");
    serverAddr.sin_port = htons(4001);

    if (bind(sock_fd, (struct sockaddr*)&serverAddr, sizeof(serverAddr)) < 0) {
    perror("bind() error\n");
    close(sock_fd);
    exit(1);
    }

    memset(&toAddr, 0, sizeof(toAddr));
    toAddr.sin_family = AF_INET;
    toAddr.sin_addr.s_addr = inet_addr("127.0.0.1");
    toAddr.sin_port = htons(4000);
    msg.msg_name = &toAddr;
    msg.msg_namelen = sizeof(toAddr);
    msg.msg_iov = &iov;
    msg.msg_iovlen = 1;
    msg.msg_iov->iov_base = recvBuffer;
    msg.msg_iov->iov_len = strlen(recvBuffer) + 1;
    msg.msg_control = 0;
    msg.msg_controllen = 0;
    msg.msg_flags = 0;

    if (sendmsg(sock_fd, &msg, 0) == -1) {
    perror("sendto() error\n");
    close(sock_fd);
    exit(1);
    }

    printf("client send data:%s\n", recvBuffer);

    memset(recvBuffer, '\0', 128);

    msg.msg_name = &toAddr;
    msg.msg_namelen = sizeof(toAddr);
    msg.msg_iov = &iov;
    msg.msg_iovlen = 1;
    msg.msg_iov->iov_base = recvBuffer;
    msg.msg_iov->iov_len = 128;
    msg.msg_control = 0;
    msg.msg_controllen = 0;
    msg.msg_flags = 0;
    if (recvmsg(sock_fd, &msg, 0) == -1) {
    perror("recvmsg() error\n");
    close(sock_fd);
    exit(1);
    }

    printf("receive data from server:%s\n", recvBuffer);

    close(sock_fd);

    return 0;
    }

    /***************** rds_server.c ********************/

    int main(void)
    {
    struct sockaddr_in fromAddr;
    int sock_fd;
    struct sockaddr_in serverAddr;
    unsigned int addrLen;
    char recvBuffer[128];
    struct msghdr msg;
    struct iovec iov;

    sock_fd = socket(AF_RDS, SOCK_SEQPACKET, 0);
    if(sock_fd < 0) {
    perror("create socket error\n");
    exit(0);
    }

    memset(&serverAddr, 0, sizeof(serverAddr));
    serverAddr.sin_family = AF_INET;
    serverAddr.sin_addr.s_addr = inet_addr("127.0.0.1");
    serverAddr.sin_port = htons(4000);
    if (bind(sock_fd, (struct sockaddr*)&serverAddr, sizeof(serverAddr)) < 0) {
    perror("bind error\n");
    close(sock_fd);
    exit(1);
    }

    printf("server is waiting to receive data...\n");
    msg.msg_name = &fromAddr;

    /*
    * I add 16 to sizeof(fromAddr), ie 32,
    * and pay attention to the definition of fromAddr,
    * recvmsg() will overwrite sock_fd,
    * since kernel will copy 32 bytes to userspace.
    *
    * If you just use sizeof(fromAddr), it works fine.
    * */
    msg.msg_namelen = sizeof(fromAddr) + 16;
    /* msg.msg_namelen = sizeof(fromAddr); */
    msg.msg_iov = &iov;
    msg.msg_iovlen = 1;
    msg.msg_iov->iov_base = recvBuffer;
    msg.msg_iov->iov_len = 128;
    msg.msg_control = 0;
    msg.msg_controllen = 0;
    msg.msg_flags = 0;

    while (1) {
    printf("old socket fd=%d\n", sock_fd);
    if (recvmsg(sock_fd, &msg, 0) == -1) {
    perror("recvmsg() error\n");
    close(sock_fd);
    exit(1);
    }
    printf("server received data from client:%s\n", recvBuffer);
    printf("msg.msg_namelen=%d\n", msg.msg_namelen);
    printf("new socket fd=%d\n", sock_fd);
    strcat(recvBuffer, "--data from server");
    if (sendmsg(sock_fd, &msg, 0) == -1) {
    perror("sendmsg()\n");
    close(sock_fd);
    exit(1);
    }
    }

    close(sock_fd);
    return 0;
    }

    Signed-off-by: Weiping Pan
    Signed-off-by: David S. Miller

    Weiping Pan