16 Apr, 2014

1 commit

  • ip_queue_xmit() assumes the skb it has to transmit is attached to an
    inet socket. Commit 31c70d5956fc ("l2tp: keep original skb ownership")
    changed l2tp to not change skb ownership and thus broke this assumption.

    One fix is to add a new 'struct sock *sk' parameter to ip_queue_xmit(),
    so that we do not assume skb->sk points to the socket used by l2tp
    tunnel.

    Fixes: 31c70d5956fc ("l2tp: keep original skb ownership")
    Reported-by: Zhan Jianyu
    Tested-by: Zhan Jianyu
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

12 Apr, 2014

1 commit

  • Several spots in the kernel perform a sequence like:

    skb_queue_tail(&sk->s_receive_queue, skb);
    sk->sk_data_ready(sk, skb->len);

    But at the moment we place the SKB onto the socket receive queue it
    can be consumed and freed up. So this skb->len access is potentially
    to freed up memory.

    Furthermore, the skb->len can be modified by the consumer so it is
    possible that the value isn't accurate.

    And finally, no actual implementation of this callback actually uses
    the length argument. And since nobody actually cared about it's
    value, lots of call sites pass arbitrary values in such as '0' and
    even '1'.

    So just remove the length argument from the callback, that way there
    is no confusion whatsoever and all of these use-after-free cases get
    fixed as a side effect.

    Based upon a patch by Eric Dumazet and his suggestion to audit this
    issue tree-wide.

    Signed-off-by: David S. Miller

    David S. Miller
     

17 Feb, 2014

1 commit

  • dccp tfrc: revert

    This reverts 6aee49c558de ("dccp: make local variable static") since
    the variable tfrc_debug is referenced by the tfrc_pr_debug(fmt, ...)
    macro when TFRC debugging is enabled. If it is enabled, use of the
    macro produces a compilation error.

    Signed-off-by: Gerrit Renker
    Signed-off-by: David S. Miller

    Gerrit Renker
     

14 Jan, 2014

1 commit

  • This new ip_no_pmtu_disc mode only allowes fragmentation-needed errors
    to be honored by protocols which do more stringent validation on the
    ICMP's packet payload. This knob is useful for people who e.g. want to
    run an unmodified DNS server in a namespace where they need to use pmtu
    for TCP connections (as they are used for zone transfers or fallback
    for requests) but don't want to use possibly spoofed UDP pmtu information.

    Currently the whitelisted protocols are TCP, SCTP and DCCP as they check
    if the returned packet is in the window or if the association is valid.

    Cc: Eric Dumazet
    Cc: David Miller
    Cc: John Heffner
    Suggested-by: Florian Weimer
    Signed-off-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Hannes Frederic Sowa
     

07 Jan, 2014

1 commit


05 Jan, 2014

2 commits


20 Dec, 2013

2 commits

  • Check the return value of request_module during dccp_probe initialisation,
    bail out if that call fails.

    Signed-off-by: Gerrit Renker
    Signed-off-by: Wang Weidong
    Signed-off-by: David S. Miller

    Wang Weidong
     
  • Steffen Klassert says:

    ====================
    pull request (net-next): ipsec-next 2013-12-19

    1) Use the user supplied policy index instead of a generated one
    if present. From Fan Du.

    2) Make xfrm migration namespace aware. From Fan Du.

    3) Make the xfrm state and policy locks namespace aware. From Fan Du.

    4) Remove ancient sleeping when the SA is in acquire state,
    we now queue packets to the policy instead. This replaces the
    sleeping code.

    5) Remove FLOWI_FLAG_CAN_SLEEP. This was used to notify xfrm about the
    posibility to sleep. The sleeping code is gone, so remove it.

    6) Check user specified spi for IPComp. Thr spi for IPcomp is only
    16 bit wide, so check for a valid value. From Fan Du.

    7) Export verify_userspi_info to check for valid user supplied spi ranges
    with pfkey and netlink. From Fan Du.

    8) RFC3173 states that if the total size of a compressed payload and the IPComp
    header is not smaller than the size of the original payload, the IP datagram
    must be sent in the original non-compressed form. These packets are dropped
    by the inbound policy check because they are not transformed. Document the need
    to set 'level use' for IPcomp to receive such packets anyway. From Fan Du.

    Please pull or let me know if there are problems.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

19 Dec, 2013

1 commit

  • IPV6_PMTU_INTERFACE is the same as IPV6_PMTU_PROBE for ipv6. Add it
    nontheless for symmetry with IPv4 sockets. Also drop incoming MTU
    information if this mode is enabled.

    The additional bit in ipv6_pinfo just eats in the padding behind the
    bitfield. There are no changes to the layout of the struct at all.

    Signed-off-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Hannes Frederic Sowa
     

11 Dec, 2013

1 commit

  • This patch is following b579035ff766c9412e2b92abf5cab794bff102b6
    "ipv6: remove old conditions on flow label sharing"

    Since there is no reason to restrict a label to a
    destination, we should not erase the destination value of a
    socket with the value contained in the flow label storage.

    This patch allows to really have the same flow label to more
    than one destination.

    Signed-off-by: Florent Fourcot
    Reviewed-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Florent Fourcot
     

06 Dec, 2013

1 commit


06 Nov, 2013

1 commit

  • Sockets marked with IP_PMTUDISC_INTERFACE won't do path mtu discovery,
    their sockets won't accept and install new path mtu information and they
    will always use the interface mtu for outgoing packets. It is guaranteed
    that the packet is not fragmented locally. But we won't set the DF-Flag
    on the outgoing frames.

    Florian Weimer had the idea to use this flag to ensure DNS servers are
    never generating outgoing fragments. They may well be fragmented on the
    path, but the server never stores or usees path mtu values, which could
    well be forged in an attack.

    (The root of the problem with path MTU discovery is that there is
    no reliable way to authenticate ICMP Fragmentation Needed But DF Set
    messages because they are sent from intermediate routers with their
    source addresses, and the IMCP payload will not always contain sufficient
    information to identify a flow.)

    Recent research in the DNS community showed that it is possible to
    implement an attack where DNS cache poisoning is feasible by spoofing
    fragments. This work was done by Amir Herzberg and Haya Shulman:

    This issue was previously discussed among the DNS community, e.g.
    ,
    without leading to fixes.

    This patch depends on the patch "ipv4: fix DO and PROBE pmtu mode
    regarding local fragmentation with UFO/CORK" for the enforcement of the
    non-fragmentable checks. If other users than ip_append_page/data should
    use this semantic too, we have to add a new flag to IPCB(skb)->flags to
    suppress local fragmentation and check for this in ip_finish_output.

    Many thanks to Florian Weimer for the idea and feedback while implementing
    this patch.

    Cc: David S. Miller
    Suggested-by: Florian Weimer
    Signed-off-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Hannes Frederic Sowa
     

20 Oct, 2013

1 commit

  • There are a mix of function prototypes with and without extern
    in the kernel sources. Standardize on not using extern for
    function prototypes.

    Function prototypes don't need to be written with extern.
    extern is assumed by the compiler. Its use is as unnecessary as
    using auto to declare automatic/local variables in a block.

    Signed-off-by: Joe Perches
    Signed-off-by: David S. Miller

    Joe Perches
     

11 Oct, 2013

1 commit

  • In commit 634fb979e8f ("inet: includes a sock_common in request_sock")
    I forgot that the two ports in sock_common do not have same byte order :

    skc_dport is __be16 (network order), but skc_num is __u16 (host order)

    So sparse complains because ir_loc_port (mapped into skc_num) is
    considered as __u16 while it should be __be16

    Let rename ir_loc_port to ireq->ir_num (analogy with inet->inet_num),
    and perform appropriate htons/ntohs conversions.

    Signed-off-by: Eric Dumazet
    Reported-by: Wu Fengguang
    Signed-off-by: David S. Miller

    Eric Dumazet
     

10 Oct, 2013

1 commit

  • TCP listener refactoring, part 5 :

    We want to be able to insert request sockets (SYN_RECV) into main
    ehash table instead of the per listener hash table to allow RCU
    lookups and remove listener lock contention.

    This patch includes the needed struct sock_common in front
    of struct request_sock

    This means there is no more inet6_request_sock IPv6 specific
    structure.

    Following inet_request_sock fields were renamed as they became
    macros to reference fields from struct sock_common.
    Prefix ir_ was chosen to avoid name collisions.

    loc_port -> ir_loc_port
    loc_addr -> ir_loc_addr
    rmt_addr -> ir_rmt_addr
    rmt_port -> ir_rmt_port
    iif -> ir_iif

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

09 Oct, 2013

2 commits

  • TCP listener refactoring, part 4 :

    To speed up inet lookups, we moved IPv4 addresses from inet to struct
    sock_common

    Now is time to do the same for IPv6, because it permits us to have fast
    lookups for all kind of sockets, including upcoming SYN_RECV.

    Getting IPv6 addresses in TCP lookups currently requires two extra cache
    lines, plus a dereference (and memory stall).

    inet6_sk(sk) does the dereference of inet_sk(__sk)->pinet6

    This patch is way bigger than its IPv4 counter part, because for IPv4,
    we could add aliases (inet_daddr, inet_rcv_saddr), while on IPv6,
    it's not doable easily.

    inet6_sk(sk)->daddr becomes sk->sk_v6_daddr
    inet6_sk(sk)->rcv_saddr becomes sk->sk_v6_rcv_saddr

    And timewait socket also have tw->tw_v6_daddr & tw->tw_v6_rcv_saddr
    at the same offset.

    We get rid of INET6_TW_MATCH() as INET6_MATCH() is now the generic
    macro.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • TCP listener refactoring, part 3 :

    Our goal is to hash SYN_RECV sockets into main ehash for fast lookup,
    and parallel SYN processing.

    Current inet_ehash_bucket contains two chains, one for ESTABLISH (and
    friend states) sockets, another for TIME_WAIT sockets only.

    As the hash table is sized to get at most one socket per bucket, it
    makes little sense to have separate twchain, as it makes the lookup
    slightly more complicated, and doubles hash table memory usage.

    If we make sure all socket types have the lookup keys at the same
    offsets, we can use a generic and faster lookup. It turns out TIME_WAIT
    and ESTABLISHED sockets already have common lookup fields for IPv4.

    [ INET_TW_MATCH() is no longer needed ]

    I'll provide a follow-up to factorize IPv6 lookup as well, to remove
    INET6_TW_MATCH()

    This way, SYN_RECV pseudo sockets will be supported the same.

    A new sock_gen_put() helper is added, doing either a sock_put() or
    inet_twsk_put() [ and will support SYN_RECV later ].

    Note this helper should only be called in real slow path, when rcu
    lookup found a socket that was moved to another identity (freed/reused
    immediately), but could eventually be used in other contexts, like
    sock_edemux()

    Before patch :

    dmesg | grep "TCP established"

    TCP established hash table entries: 524288 (order: 11, 8388608 bytes)

    After patch :

    TCP established hash table entries: 524288 (order: 10, 4194304 bytes)

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

19 Sep, 2013

1 commit


25 Jul, 2013

1 commit

  • Several call sites use the hardcoded following condition :

    sk_stream_wspace(sk) >= sk_stream_min_wspace(sk)

    Lets use a helper because TCP_NOTSENT_LOWAT support will change this
    condition for TCP sockets.

    Signed-off-by: Eric Dumazet
    Cc: Neal Cardwell
    Cc: Yuchung Cheng
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Eric Dumazet
     

18 Mar, 2013

1 commit

  • TCPCT uses option-number 253, reserved for experimental use and should
    not be used in production environments.
    Further, TCPCT does not fully implement RFC 6013.

    As a nice side-effect, removing TCPCT increases TCP's performance for
    very short flows:

    Doing an apache-benchmark with -c 100 -n 100000, sending HTTP-requests
    for files of 1KB size.

    before this patch:
    average (among 7 runs) of 20845.5 Requests/Second
    after:
    average (among 7 runs) of 21403.6 Requests/Second

    Signed-off-by: Christoph Paasch
    Signed-off-by: David S. Miller

    Christoph Paasch
     

22 Feb, 2013

1 commit

  • Pull driver core patches from Greg Kroah-Hartman:
    "Here is the big driver core merge for 3.9-rc1

    There are two major series here, both of which touch lots of drivers
    all over the kernel, and will cause you some merge conflicts:

    - add a new function called devm_ioremap_resource() to properly be
    able to check return values.

    - remove CONFIG_EXPERIMENTAL

    Other than those patches, there's not much here, some minor fixes and
    updates"

    Fix up trivial conflicts

    * tag 'driver-core-3.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (221 commits)
    base: memory: fix soft/hard_offline_page permissions
    drivercore: Fix ordering between deferred_probe and exiting initcalls
    backlight: fix class_find_device() arguments
    TTY: mark tty_get_device call with the proper const values
    driver-core: constify data for class_find_device()
    firmware: Ignore abort check when no user-helper is used
    firmware: Reduce ifdef CONFIG_FW_LOADER_USER_HELPER
    firmware: Make user-mode helper optional
    firmware: Refactoring for splitting user-mode helper code
    Driver core: treat unregistered bus_types as having no devices
    watchdog: Convert to devm_ioremap_resource()
    thermal: Convert to devm_ioremap_resource()
    spi: Convert to devm_ioremap_resource()
    power: Convert to devm_ioremap_resource()
    mtd: Convert to devm_ioremap_resource()
    mmc: Convert to devm_ioremap_resource()
    mfd: Convert to devm_ioremap_resource()
    media: Convert to devm_ioremap_resource()
    iommu: Convert to devm_ioremap_resource()
    drm: Convert to devm_ioremap_resource()
    ...

    Linus Torvalds
     

19 Feb, 2013

2 commits

  • proc_net_remove is only used to remove proc entries
    that under /proc/net,it's not a general function for
    removing proc entries of netns. if we want to remove
    some proc entries which under /proc/net/stat/, we still
    need to call remove_proc_entry.

    this patch use remove_proc_entry to replace proc_net_remove.
    we can remove proc_net_remove after this patch.

    Signed-off-by: Gao feng
    Signed-off-by: David S. Miller

    Gao feng
     
  • Right now, some modules such as bonding use proc_create
    to create proc entries under /proc/net/, and other modules
    such as ipv4 use proc_net_fops_create.

    It looks a little chaos.this patch changes all of
    proc_net_fops_create to proc_create. we can remove
    proc_net_fops_create after this patch.

    Signed-off-by: Gao feng
    Signed-off-by: David S. Miller

    Gao feng
     

12 Jan, 2013

2 commits

  • The CONFIG_EXPERIMENTAL config item has not carried much meaning for a
    while now and is almost always enabled by default. As agreed during the
    Linux kernel summit, remove it from any "depends on" lines in Kconfigs.

    CC: Gerrit Renker
    CC: "David S. Miller"
    Signed-off-by: Kees Cook
    Acked-by: David S. Miller
    Acked-by: Gerrit Renker

    Kees Cook
     
  • The CONFIG_EXPERIMENTAL config item has not carried much meaning for a
    while now and is almost always enabled by default. As agreed during the
    Linux kernel summit, remove it from any "depends on" lines in Kconfigs.

    CC: Gerrit Renker
    CC: "David S. Miller"
    Signed-off-by: Kees Cook
    Acked-by: David S. Miller
    Acked-by: Gerrit Renker

    Kees Cook
     

15 Dec, 2012

1 commit

  • If in either of the above functions inet_csk_route_child_sock() or
    __inet_inherit_port() fails, the newsk will not be freed:

    unreferenced object 0xffff88022e8a92c0 (size 1592):
    comm "softirq", pid 0, jiffies 4294946244 (age 726.160s)
    hex dump (first 32 bytes):
    0a 01 01 01 0a 01 01 02 00 00 00 00 a7 cc 16 00 ................
    02 00 03 01 00 00 00 00 00 00 00 00 00 00 00 00 ................
    backtrace:
    [] kmemleak_alloc+0x21/0x3e
    [] kmem_cache_alloc+0xb5/0xc5
    [] sk_prot_alloc.isra.53+0x2b/0xcd
    [] sk_clone_lock+0x16/0x21e
    [] inet_csk_clone_lock+0x10/0x7b
    [] tcp_create_openreq_child+0x21/0x481
    [] tcp_v4_syn_recv_sock+0x3a/0x23b
    [] tcp_check_req+0x29f/0x416
    [] tcp_v4_do_rcv+0x161/0x2bc
    [] tcp_v4_rcv+0x6c9/0x701
    [] ip_local_deliver_finish+0x70/0xc4
    [] ip_local_deliver+0x4e/0x7f
    [] ip_rcv_finish+0x1fc/0x233
    [] ip_rcv+0x217/0x267
    [] __netif_receive_skb+0x49e/0x553
    [] netif_receive_skb+0x50/0x82

    This happens, because sk_clone_lock initializes sk_refcnt to 2, and thus
    a single sock_put() is not enough to free the memory. Additionally, things
    like xfrm, memcg, cookie_values,... may have been initialized.
    We have to free them properly.

    This is fixed by forcing a call to tcp_done(), ending up in
    inet_csk_destroy_sock, doing the final sock_put(). tcp_done() is necessary,
    because it ends up doing all the cleanup on xfrm, memcg, cookie_values,
    xfrm,...

    Before calling tcp_done, we have to set the socket to SOCK_DEAD, to
    force it entering inet_csk_destroy_sock. To avoid the warning in
    inet_csk_destroy_sock, inet_num has to be set to 0.
    As inet_csk_destroy_sock does a dec on orphan_count, we first have to
    increase it.

    Calling tcp_done() allows us to remove the calls to
    tcp_clear_xmit_timer() and tcp_cleanup_congestion_control().

    A similar approach is taken for dccp by calling dccp_done().

    This is in the kernel since 093d282321 (tproxy: fix hash locking issue
    when using port redirection in __inet_inherit_port()), thus since
    version >= 2.6.37.

    Signed-off-by: Christoph Paasch
    Signed-off-by: David S. Miller

    Christoph Paasch
     

04 Nov, 2012

1 commit

  • For passive TCP connections using TCP_DEFER_ACCEPT facility,
    we incorrectly increment req->retrans each time timeout triggers
    while no SYNACK is sent.

    SYNACK are not sent for TCP_DEFER_ACCEPT that were established (for
    which we received the ACK from client). Only the last SYNACK is sent
    so that we can receive again an ACK from client, to move the req into
    accept queue. We plan to change this later to avoid the useless
    retransmit (and potential problem as this SYNACK could be lost)

    TCP_INFO later gives wrong information to user, claiming imaginary
    retransmits.

    Decouple req->retrans field into two independent fields :

    num_retrans : number of retransmit
    num_timeout : number of timeouts

    num_timeout is the counter that is incremented at each timeout,
    regardless of actual SYNACK being sent or not, and used to
    compute the exponential timeout.

    Introduce inet_rtx_syn_ack() helper to increment num_retrans
    only if ->rtx_syn_ack() succeeded.

    Use inet_rtx_syn_ack() from tcp_check_req() to increment num_retrans
    when we re-send a SYNACK in answer to a (retransmitted) SYN.
    Prior to this patch, we were not counting these retransmits.

    Change tcp_v[46]_rtx_synack() to increment TCP_MIB_RETRANSSEGS
    only if a synack packet was successfully queued.

    Reported-by: Yuchung Cheng
    Signed-off-by: Eric Dumazet
    Cc: Julian Anastasov
    Cc: Vijay Subramanian
    Cc: Elliott Hughes
    Cc: Neal Cardwell
    Signed-off-by: David S. Miller

    Eric Dumazet
     

16 Aug, 2012

2 commits

  • The CCID3 code fails to initialize the trailing padding bytes of struct
    tfrc_tx_info added for alignment on 64 bit architectures. It that for
    potentially leaks four bytes kernel stack via the getsockopt() syscall.
    Add an explicit memset(0) before filling the structure to avoid the
    info leak.

    Signed-off-by: Mathias Krause
    Cc: Gerrit Renker
    Signed-off-by: David S. Miller

    Mathias Krause
     
  • ccid_hc_rx_getsockopt() and ccid_hc_tx_getsockopt() might be called with
    a NULL ccid pointer leading to a NULL pointer dereference. This could
    lead to a privilege escalation if the attacker is able to map page 0 and
    prepare it with a fake ccid_ops pointer.

    Signed-off-by: Mathias Krause
    Cc: Gerrit Renker
    Cc: stable@vger.kernel.org
    Signed-off-by: David S. Miller

    Mathias Krause
     

24 Jul, 2012

1 commit

  • Use inet_iif() consistently, and for TCP record the input interface of
    cached RX dst in inet sock.

    rt->rt_iif is going to be encoded differently, so that we can
    legitimately cache input routes in the FIB info more aggressively.

    When the input interface is "use SKB device index" the rt->rt_iif will
    be set to zero.

    This forces us to move the TCP RX dst cache installation into the ipv4
    specific code, and as well it should since doing the route caching for
    ipv6 is pointless at the moment since it is not inspected in the ipv6
    input paths yet.

    Also, remove the unlikely on dst->obsolete, all ipv4 dsts have
    obsolete set to a non-zero value to force invocation of the check
    callback.

    Signed-off-by: David S. Miller

    David S. Miller
     

21 Jul, 2012

1 commit


17 Jul, 2012

1 commit

  • This will be used so that we can compose a full flow key.

    Even though we have a route in this context, we need more. In the
    future the routes will be without destination address, source address,
    etc. keying. One ipv4 route will cover entire subnets, etc.

    In this environment we have to have a way to possess persistent storage
    for redirects and PMTU information. This persistent storage will exist
    in the FIB tables, and that's why we'll need to be able to rebuild a
    full lookup flow key here. Using that flow key will do a fib_lookup()
    and create/update the persistent entry.

    Signed-off-by: David S. Miller

    David S. Miller
     

16 Jul, 2012

2 commits

  • This is the ipv6 version of inet_csk_update_pmtu().

    Signed-off-by: David S. Miller

    David S. Miller
     
  • This abstracts away the call to dst_ops->update_pmtu() so that we can
    transparently handle the fact that, in the future, the dst itself can
    be invalidated by the PMTU update (when we have non-host routes cached
    in sockets).

    So we try to rebuild the socket cached route after the method
    invocation if necessary.

    This isn't used by SCTP because it needs to cache dsts per-transport,
    and thus will need it's own local version of this helper.

    Signed-off-by: David S. Miller

    David S. Miller
     

12 Jul, 2012

3 commits


11 Jul, 2012

1 commit


05 Jul, 2012

1 commit