10 Dec, 2011

1 commit


03 Dec, 2011

1 commit


02 Dec, 2011

1 commit

  • This reverts commit 81d54ec8479a2c695760da81f05b5a9fb2dbe40a.

    If we take the "try_again" goto, due to a checksum error,
    the 'len' has already been truncated. So we won't compute
    the same values as the original code did.

    Reported-by: paul bilke
    Signed-off-by: David S. Miller

    David S. Miller
     

17 Nov, 2011

1 commit


10 Nov, 2011

1 commit

  • Le lundi 07 novembre 2011 à 15:33 +0100, Eric Dumazet a écrit :

    > At least, in recent kernels we dont change dst->refcnt in forwarding
    > patch (usinf NOREF skb->dst)
    >
    > One particular point is the atomic_inc(dst->refcnt) we have to perform
    > when queuing an UDP packet if socket asked PKTINFO stuff (for example a
    > typical DNS server has to setup this option)
    >
    > I have one patch somewhere that stores the information in skb->cb[] and
    > avoid the atomic_{inc|dec}(dst->refcnt).
    >

    OK I found it, I did some extra tests and believe its ready.

    [PATCH net-next] ipv4: IP_PKTINFO doesnt need dst reference

    When a socket uses IP_PKTINFO notifications, we currently force a dst
    reference for each received skb. Reader has to access dst to get needed
    information (rt_iif & rt_spec_dst) and must release dst reference.

    We also forced a dst reference if skb was put in socket backlog, even
    without IP_PKTINFO handling. This happens under stress/load.

    We can instead store the needed information in skb->cb[], so that only
    softirq handler really access dst, improving cache hit ratios.

    This removes two atomic operations per packet, and false sharing as
    well.

    On a benchmark using a mono threaded receiver (doing only recvmsg()
    calls), I can reach 720.000 pps instead of 570.000 pps.

    IP_PKTINFO is typically used by DNS servers, and any multihomed aware
    UDP application.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

02 Nov, 2011

2 commits

  • udp_queue_rcv_skb() has a possible race in encap_rcv handling, since
    this pointer can be changed anytime.

    We should use ACCESS_ONCE() to close the race.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • the tcp and udp code creates a set of struct file_operations at runtime
    while it can also be done at compile time, with the added benefit of then
    having these file operations be const.

    the trickiest part was to get the "THIS_MODULE" reference right; the naive
    method of declaring a struct in the place of registration would not work
    for this reason.

    Signed-off-by: Arjan van de Ven
    Signed-off-by: David S. Miller

    Arjan van de Ven
     

18 Aug, 2011

1 commit

  • The l4_rxhash flag was added to the skb structure to indicate
    that the rxhash value was computed over the 4 tuple for the
    packet which includes the port information in the encapsulated
    transport packet. This is used by the stack to preserve the
    rxhash value in __skb_rx_tunnel.

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     

12 Aug, 2011

1 commit


14 Jul, 2011

1 commit


07 Jul, 2011

1 commit

  • Current tcp/udp/sctp global memory limits are not taking into account
    hugepages allocations, and allow 50% of ram to be used by buffers of a
    single protocol [ not counting space used by sockets / inodes ...]

    Lets use nr_free_buffer_pages() and allow a default of 1/8 of kernel ram
    per protocol, and a minimum of 128 pages.
    Heavy duty machines sysadmins probably need to tweak limits anyway.

    References: https://bugzilla.stlinux.com/show_bug.cgi?id=38032
    Reported-by: starlight
    Suggested-by: Andrew Morton
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

06 Jul, 2011

1 commit


22 Jun, 2011

2 commits

  • Consider this scenario: When the size of the first received udp packet
    is bigger than the receive buffer, MSG_TRUNC bit is set in msg->msg_flags.
    However, if checksum error happens and this is a blocking socket, it will
    goto try_again loop to receive the next packet. But if the size of the
    next udp packet is smaller than receive buffer, MSG_TRUNC flag should not
    be set, but because MSG_TRUNC bit is not cleared in msg->msg_flags before
    receive the next packet, MSG_TRUNC is still set, which is wrong.

    Fix this problem by clearing MSG_TRUNC flag when starting over for a
    new packet.

    Signed-off-by: Xufeng Zhang
    Signed-off-by: Paul Gortmaker
    Signed-off-by: David S. Miller

    Xufeng Zhang
     
  • This patch adds a tracepoint to __udp_queue_rcv_skb to get the
    return value of ip_queue_rcv_skb. It indicates why kernel drops
    a packet at this point.

    ip_queue_rcv_skb returns following values in the packet drop case:

    rcvbuf is full : -ENOMEM
    sk_filter returns error : -EINVAL, -EACCESS, -ENOMEM, etc.
    __sk_mem_schedule returns error: -ENOBUF

    Signed-off-by: Satoru Moriya
    Acked-by: Neil Horman
    Signed-off-by: David S. Miller

    Satoru Moriya
     

24 May, 2011

1 commit

  • The %pK format specifier is designed to hide exposed kernel pointers,
    specifically via /proc interfaces. Exposing these pointers provides an
    easy target for kernel write vulnerabilities, since they reveal the
    locations of writable structures containing easily triggerable function
    pointers. The behavior of %pK depends on the kptr_restrict sysctl.

    If kptr_restrict is set to 0, no deviation from the standard %p behavior
    occurs. If kptr_restrict is set to 1, the default, if the current user
    (intended to be a reader via seq_printf(), etc.) does not have CAP_SYSLOG
    (currently in the LSM tree), kernel pointers using %pK are printed as 0's.
    If kptr_restrict is set to 2, kernel pointers using %pK are printed as
    0's regardless of privileges. Replacing with 0's was chosen over the
    default "(null)", which cannot be parsed by userland %p, which expects
    "(nil)".

    The supporting code for kptr_restrict and %pK are currently in the -mm
    tree. This patch converts users of %p in net/ to %pK. Cases of printing
    pointers to the syslog are not covered, since this would eliminate useful
    information for postmortem debugging and the reading of the syslog is
    already optionally protected by the dmesg_restrict sysctl.

    Signed-off-by: Dan Rosenberg
    Cc: James Morris
    Cc: Eric Dumazet
    Cc: Thomas Graf
    Cc: Eugene Teo
    Cc: Kees Cook
    Cc: Ingo Molnar
    Cc: David S. Miller
    Cc: Peter Zijlstra
    Cc: Eric Paris
    Signed-off-by: Andrew Morton
    Signed-off-by: David S. Miller

    Dan Rosenberg
     

11 May, 2011

1 commit


09 May, 2011

3 commits


29 Apr, 2011

1 commit

  • We lack proper synchronization to manipulate inet->opt ip_options

    Problem is ip_make_skb() calls ip_setup_cork() and
    ip_setup_cork() possibly makes a copy of ipc->opt (struct ip_options),
    without any protection against another thread manipulating inet->opt.

    Another thread can change inet->opt pointer and free old one under us.

    Use RCU to protect inet->opt (changed to inet->inet_opt).

    Instead of handling atomic refcounts, just copy ip_options when
    necessary, to avoid cache line dirtying.

    We cant insert an rcu_head in struct ip_options since its included in
    skb->cb[], so this patch is large because I had to introduce a new
    ip_options_rcu structure.

    Signed-off-by: Eric Dumazet
    Cc: Herbert Xu
    Signed-off-by: David S. Miller

    Eric Dumazet
     

23 Apr, 2011

1 commit


12 Apr, 2011

1 commit


31 Mar, 2011

2 commits


13 Mar, 2011

5 commits


04 Mar, 2011

1 commit

  • As reported by Eric:

    [11483.697233] IP: [] dst_release+0x18/0x60
    ...
    [11483.697741] Call Trace:
    [11483.697764] [] udp_sendmsg+0x282/0x6e0
    [11483.697790] [] ? memcpy_toiovec+0x51/0x70
    [11483.697818] [] ? ip_generic_getfrag+0x0/0xb0

    The pointer passed to dst_release() is -EINVAL, that's because
    we leave an error pointer in the local variable "rt" by accident.

    NULL it out to fix the bug.

    Reported-by: Eric Dumazet
    Signed-off-by: David S. Miller

    David S. Miller
     

03 Mar, 2011

1 commit


02 Mar, 2011

5 commits

  • This boolean state is now available in the flow flags.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • And set is in contexts where the route resolution can sleep.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Since that is what the current vague "flags" argument means.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • The UDP transmit path has been running under the socket lock
    for a long time because of the corking feature. This means that
    transmitting to the same socket in multiple threads does not
    scale at all.

    However, as most users don't actually use corking, the locking
    can be removed in the common case.

    This patch creates a lockless fast path where corking is not used.

    Please note that this does create a slight inaccuracy in the
    enforcement of socket send buffer limits. In particular, we
    may exceed the socket limit by up to (number of CPUs) * (packet
    size) because of the way the limit is computed.

    As the primary purpose of socket buffers is to indicate congestion,
    this should not be a great problem for now.

    Signed-off-by: Herbert Xu
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Herbert Xu
     
  • This patch converts UDP to use the new ip_finish_skb API. This
    would then allows us to more easily use ip_make_skb which allows
    UDP to run without a socket lock.

    Signed-off-by: Herbert Xu
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Herbert Xu
     

25 Jan, 2011

1 commit

  • Quoting Ben Hutchings: we presumably won't be defining features that
    can only be enabled on 64-bit architectures.

    Occurences found by `grep -r` on net/, drivers/net, include/

    [ Move features and vlan_features next to each other in
    struct netdev, as per Eric Dumazet's suggestion -DaveM ]

    Signed-off-by: Michał Mirosław
    Signed-off-by: David S. Miller

    Michał Mirosław
     

18 Dec, 2010

1 commit


17 Dec, 2010

2 commits

  • Replace skb->csum_start - skb_headroom(skb) with skb_checksum_start_offset().

    Note for usb/smsc95xx: skb->data - skb->head == skb_headroom(skb).

    Signed-off-by: Michał Mirosław
    Signed-off-by: David S. Miller

    Michał Mirosław
     
  • Special care is taken inside sk_port_alloc to avoid overwriting
    skc_node/skc_nulls_node. We should also avoid overwriting
    skc_bind_node/skc_portaddr_node.

    The patch fixes the following crash:

    BUG: unable to handle kernel paging request at fffffffffffffff0
    IP: [] udp4_lib_lookup2+0xad/0x370
    [] __udp4_lib_lookup+0x282/0x360
    [] __udp4_lib_rcv+0x31e/0x700
    [] ? ip_local_deliver_finish+0x65/0x190
    [] ? ip_local_deliver+0x88/0xa0
    [] udp_rcv+0x15/0x20
    [] ip_local_deliver_finish+0x65/0x190
    [] ip_local_deliver+0x88/0xa0
    [] ip_rcv_finish+0x32d/0x6f0
    [] ? netif_receive_skb+0x99c/0x11c0
    [] ip_rcv+0x2bb/0x350
    [] netif_receive_skb+0x99c/0x11c0

    Signed-off-by: Leonard Crestez
    Signed-off-by: Octavian Purdila
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Octavian Purdila