04 Nov, 2018

1 commit

  • [ Upstream commit 89ab066d4229acd32e323f1569833302544a4186 ]

    This reverts commit dd979b4df817e9976f18fb6f9d134d6bc4a3c317.

    This broke tcp_poll for SMC fallback: An AF_SMC socket establishes an
    internal TCP socket for the initial handshake with the remote peer.
    Whenever the SMC connection can not be established this TCP socket is
    used as a fallback. All socket operations on the SMC socket are then
    forwarded to the TCP socket. In case of poll, the file->private_data
    pointer references the SMC socket because the TCP socket has no file
    assigned. This causes tcp_poll to wait on the wrong socket.

    Signed-off-by: Karsten Graul
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Karsten Graul
     

31 Jul, 2018

1 commit


29 Jun, 2018

1 commit

  • The poll() changes were not well thought out, and completely
    unexplained. They also caused a huge performance regression, because
    "->poll()" was no longer a trivial file operation that just called down
    to the underlying file operations, but instead did at least two indirect
    calls.

    Indirect calls are sadly slow now with the Spectre mitigation, but the
    performance problem could at least be largely mitigated by changing the
    "->get_poll_head()" operation to just have a per-file-descriptor pointer
    to the poll head instead. That gets rid of one of the new indirections.

    But that doesn't fix the new complexity that is completely unwarranted
    for the regular case. The (undocumented) reason for the poll() changes
    was some alleged AIO poll race fixing, but we don't make the common case
    slower and more complex for some uncommon special case, so this all
    really needs way more explanations and most likely a fundamental
    redesign.

    [ This revert is a revert of about 30 different commits, not reverted
    individually because that would just be unnecessarily messy - Linus ]

    Cc: Al Viro
    Cc: Christoph Hellwig
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

17 Jun, 2018

1 commit

  • ATM accounts for in-flight TX packets in sk_wmem_alloc of the VCC on
    which they are to be sent. But it doesn't take ownership of those
    packets from the sock (if any) which originally owned them. They should
    remain owned by their actual sender until they've left the box.

    There's a hack in pskb_expand_head() to avoid adjusting skb->truesize
    for certain skbs, precisely to avoid messing up sk_wmem_alloc
    accounting. Ideally that hack would cover the ATM use case too, but it
    doesn't — skbs which aren't owned by any sock, for example PPP control
    frames, still get their truesize adjusted when the low-level ATM driver
    adds headroom.

    This has always been an issue, it seems. The truesize of a packet
    increases, and sk_wmem_alloc on the VCC goes negative. But this wasn't
    for normal traffic, only for control frames. So I think we just got away
    with it, and we probably needed to send 2GiB of LCP echo frames before
    the misaccounting would ever have caused a problem and caused
    atm_may_send() to start refusing packets.

    Commit 14afee4b609 ("net: convert sock.sk_wmem_alloc from atomic_t to
    refcount_t") did exactly what it was intended to do, and turned this
    mostly-theoretical problem into a real one, causing PPPoATM to fail
    immediately as sk_wmem_alloc underflows and atm_may_send() *immediately*
    starts refusing to allow new packets.

    The least intrusive solution to this problem is to stash the value of
    skb->truesize that was accounted to the VCC, in a new member of the
    ATM_SKB(skb) structure. Then in atm_pop_raw() subtract precisely that
    value instead of the then-current value of skb->truesize.

    Fixes: 158f323b9868 ("net: adjust skb->truesize in pskb_expand_head()")
    Signed-off-by: David Woodhouse
    Tested-by: Kevin Darbyshire-Bryant
    Signed-off-by: David S. Miller

    David Woodhouse
     

26 May, 2018

1 commit


12 Feb, 2018

1 commit

  • This is the mindless scripted replacement of kernel use of POLL*
    variables as described by Al, done by this script:

    for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
    L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
    for f in $L; do sed -i "-es/^\([^\"]*\)\(\\)/\\1E\\2/" $f; done
    done

    with de-mangling cleanups yet to come.

    NOTE! On almost all architectures, the EPOLL* constants have the same
    values as the POLL* constants do. But they keyword here is "almost".
    For various bad reasons they aren't the same, and epoll() doesn't
    actually work quite correctly in some cases due to this on Sparc et al.

    The next patch from Al will sort out the final differences, and we
    should be all done.

    Scripted-by: Al Viro
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

01 Feb, 2018

1 commit

  • Pull networking updates from David Miller:

    1) Significantly shrink the core networking routing structures. Result
    of http://vger.kernel.org/~davem/seoul2017_netdev_keynote.pdf

    2) Add netdevsim driver for testing various offloads, from Jakub
    Kicinski.

    3) Support cross-chip FDB operations in DSA, from Vivien Didelot.

    4) Add a 2nd listener hash table for TCP, similar to what was done for
    UDP. From Martin KaFai Lau.

    5) Add eBPF based queue selection to tun, from Jason Wang.

    6) Lockless qdisc support, from John Fastabend.

    7) SCTP stream interleave support, from Xin Long.

    8) Smoother TCP receive autotuning, from Eric Dumazet.

    9) Lots of erspan tunneling enhancements, from William Tu.

    10) Add true function call support to BPF, from Alexei Starovoitov.

    11) Add explicit support for GRO HW offloading, from Michael Chan.

    12) Support extack generation in more netlink subsystems. From Alexander
    Aring, Quentin Monnet, and Jakub Kicinski.

    13) Add 1000BaseX, flow control, and EEE support to mvneta driver. From
    Russell King.

    14) Add flow table abstraction to netfilter, from Pablo Neira Ayuso.

    15) Many improvements and simplifications to the NFP driver bpf JIT,
    from Jakub Kicinski.

    16) Support for ipv6 non-equal cost multipath routing, from Ido
    Schimmel.

    17) Add resource abstration to devlink, from Arkadi Sharshevsky.

    18) Packet scheduler classifier shared filter block support, from Jiri
    Pirko.

    19) Avoid locking in act_csum, from Davide Caratti.

    20) devinet_ioctl() simplifications from Al viro.

    21) More TCP bpf improvements from Lawrence Brakmo.

    22) Add support for onlink ipv6 route flag, similar to ipv4, from David
    Ahern.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1925 commits)
    tls: Add support for encryption using async offload accelerator
    ip6mr: fix stale iterator
    net/sched: kconfig: Remove blank help texts
    openvswitch: meter: Use 64-bit arithmetic instead of 32-bit
    tcp_nv: fix potential integer overflow in tcpnv_acked
    r8169: fix RTL8168EP take too long to complete driver initialization.
    qmi_wwan: Add support for Quectel EP06
    rtnetlink: enable IFLA_IF_NETNSID for RTM_NEWLINK
    ipmr: Fix ptrdiff_t print formatting
    ibmvnic: Wait for device response when changing MAC
    qlcnic: fix deadlock bug
    tcp: release sk_frag.page in tcp_disconnect
    ipv4: Get the address of interface correctly.
    net_sched: gen_estimator: fix lockdep splat
    net: macb: Handle HRESP error
    net/mlx5e: IPoIB, Fix copy-paste bug in flow steering refactoring
    ipv6: addrconf: break critical section in addrconf_verify_rtnl()
    ipv6: change route cache aging logic
    i40e/i40evf: Update DESC_NEEDED value to reflect larger value
    bnxt_en: cleanup DIM work on device shutdown
    ...

    Linus Torvalds
     

30 Nov, 2017

1 commit

  • net/atm/mpoa_* files use 'struct timeval' to store event
    timestamps. struct timeval uses a 32-bit seconds field which will
    overflow in the year 2038 and beyond. Morever, the timestamps are being
    compared only to get seconds elapsed, so struct timeval which stores
    a seconds and microseconds field is an overkill. This patch replaces
    the use of struct timeval with time64_t to store a 64-bit seconds field.

    Signed-off-by: Tina Ruchandani
    Signed-off-by: Arnd Bergmann
    Signed-off-by: David S. Miller

    Tina Ruchandani
     

28 Nov, 2017

1 commit


01 Jul, 2017

1 commit

  • refcount_t type and corresponding API should be
    used instead of atomic_t when the variable is used as
    a reference counter. This allows to avoid accidental
    refcounter overflows that might lead to use-after-free
    situations.

    Signed-off-by: Elena Reshetova
    Signed-off-by: Hans Liljestrand
    Signed-off-by: Kees Cook
    Signed-off-by: David Windsor
    Signed-off-by: David S. Miller

    Reshetova, Elena
     

14 Mar, 2017

1 commit

  • Andrey reported this kernel warning:

    WARNING: CPU: 0 PID: 4114 at kernel/sched/core.c:7737 __might_sleep+0x149/0x1a0
    do not call blocking ops when !TASK_RUNNING; state=1 set at
    [] prepare_to_wait+0x182/0x530

    The deeply nested alloc_skb is a problem.

    Diagnosis: nesting is wrong. It makes zero sense. Fix it and the
    implicit task state change problem automagically goes away.

    alloc_skb() does not need to be in the "while" loop.

    alloc_skb() does not need to be in the {prepare_to_wait/add_wait_queue ...
    finish_wait/remove_wait_queue} block.

    I claim that:
    - alloc_tx() should only perform the "wait_for_decent_tx_drain" part
    - alloc_skb() ought to be done directly in vcc_sendmsg
    - alloc_skb() failure can be handled gracefully in vcc_sendmsg
    - alloc_skb() may use a (m->msg_flags & MSG_DONTWAIT) dependent
    GFP_{KERNEL / ATOMIC} flag

    Reported-by: Andrey Konovalov
    Reviewed-and-Tested-by: Chas Williams
    Signed-off-by: Chas Williams
    Signed-off-by: David S. Miller

    Francois Romieu
     

02 Mar, 2017

1 commit


06 Dec, 2016

1 commit

  • copy_from_iter_full(), copy_from_iter_full_nocache() and
    csum_and_copy_from_iter_full() - counterparts of copy_from_iter()
    et.al., advancing iterator only in case of successful full copy
    and returning whether it had been successful or not.

    Convert some obvious users. *NOTE* - do not blindly assume that
    something is a good candidate for those unless you are sure that
    not advancing iov_iter in failure case is the right thing in
    this case. Anything that does short read/short write kind of
    stuff (or is in a loop, etc.) is unlikely to be a good one.

    Signed-off-by: Al Viro

    Al Viro
     

01 Dec, 2015

1 commit

  • The memory barrier in the helper wq_has_sleeper is needed by just
    about every user of waitqueue_active. This patch generalises it
    by making it take a wait_queue_head_t directly. The existing
    helper is renamed to skwq_has_sleeper.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     

11 May, 2015

1 commit


03 Mar, 2015

1 commit

  • After TIPC doesn't depend on iocb argument in its internal
    implementations of sendmsg() and recvmsg() hooks defined in proto
    structure, no any user is using iocb argument in them at all now.
    Then we can drop the redundant iocb argument completely from kinds of
    implementations of both sendmsg() and recvmsg() in the entire
    networking stack.

    Cc: Christoph Hellwig
    Suggested-by: Al Viro
    Signed-off-by: Ying Xue
    Signed-off-by: David S. Miller

    Ying Xue
     

10 Dec, 2014

1 commit

  • Note that the code _using_ ->msg_iter at that point will be very
    unhappy with anything other than unshifted iovec-backed iov_iter.
    We still need to convert users to proper primitives.

    Signed-off-by: Al Viro

    Al Viro
     

24 Nov, 2014

1 commit

  • ... and make it handle multi-segment iovecs - deals with that
    "fix this later" issue for free. A bit of shame, really - it
    had been there since 2.3.15pre3 when the whole thing went into the
    tree, practically a historical artefact by now...

    Signed-off-by: Al Viro

    Al Viro
     

06 Nov, 2014

1 commit

  • This encapsulates all of the skb_copy_datagram_iovec() callers
    with call argument signature "skb, offset, msghdr->msg_iov, length".

    When we move to iov_iters in the networking, the iov_iter object will
    sit in the msghdr.

    Having a helper like this means there will be less places to touch
    during that transformation.

    Based upon descriptions and patch from Al Viro.

    Signed-off-by: David S. Miller

    David S. Miller
     

11 Sep, 2014

1 commit


21 Nov, 2013

1 commit


08 Apr, 2013

1 commit

  • The current code does not fill the msg_name member in case it is set.
    It also does not set the msg_namelen member to 0 and therefore makes
    net/socket.c leak the local, uninitialized sockaddr_storage variable
    to userland -- 128 bytes of kernel stack memory.

    Fix that by simply setting msg_namelen to 0 as obviously nobody cared
    about vcc_recvmsg() not filling the msg_name in case it was set.

    Signed-off-by: Mathias Krause
    Signed-off-by: David S. Miller

    Mathias Krause
     

28 Feb, 2013

1 commit

  • I'm not sure why, but the hlist for each entry iterators were conceived

    list_for_each_entry(pos, head, member)

    The hlist ones were greedy and wanted an extra parameter:

    hlist_for_each_entry(tpos, pos, head, member)

    Why did they need an extra pos parameter? I'm not quite sure. Not only
    they don't really need it, it also prevents the iterator from looking
    exactly like the list iterator, which is unfortunate.

    Besides the semantic patch, there was some manual work required:

    - Fix up the actual hlist iterators in linux/list.h
    - Fix up the declaration of other iterators based on the hlist ones.
    - A very small amount of places were using the 'node' parameter, this
    was modified to use 'obj->member' instead.
    - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
    properly, so those had to be fixed up manually.

    The semantic patch which is mostly the work of Peter Senna Tschudin is here:

    @@
    iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;

    type T;
    expression a,c,d,e;
    identifier b;
    statement S;
    @@

    -T b;

    [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
    [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
    [akpm@linux-foundation.org: checkpatch fixes]
    [akpm@linux-foundation.org: fix warnings]
    [akpm@linux-foudnation.org: redo intrusive kvm changes]
    Tested-by: Peter Senna Tschudin
    Acked-by: Paul E. McKenney
    Signed-off-by: Sasha Levin
    Cc: Wu Fengguang
    Cc: Marcelo Tosatti
    Cc: Gleb Natapov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     

02 Dec, 2012

1 commit

  • The immediate use case for this is that it will allow us to ensure that a
    pppoatm queue is woken after it has to drop a packet due to the sock being
    locked.

    Note that 'release_cb' is called when the socket is *unlocked*. This is
    not to be confused with vcc_release() — which probably ought to be called
    vcc_close().

    Signed-off-by: David Woodhouse
    Acked-by: Krzysztof Mazur

    David Woodhouse
     

28 Nov, 2012

1 commit

  • The atm is using atmvcc->push(vcc, NULL) callback to notify protocol
    that vcc will be closed and protocol must detach from it. This callback
    is usually used by protocol to decrement module usage count by module_put(),
    but it leaves small window then module is still used after module_put().

    Now the owner of push() callback is kept in atmvcc and
    module_put(atmvcc->owner) is called after the protocol is detached from vcc.

    Signed-off-by: Krzysztof Mazur
    Signed-off-by: David Woodhouse
    Acked-by: Chas Williams

    Krzysztof Mazur
     

16 Aug, 2012

1 commit


23 Nov, 2011

2 commits

  • Now that the vcc backends do the right thing with respect the receive
    queue on registration, allow MSK_PEEK for atm sockets.

    This allows a userspace program to inspect the packets and decide what
    backend to use to handle them.

    Signed-off-by: Jorge Boncompte [DTI2]
    Signed-off-by: David S. Miller

    Jorge Boncompte [DTI2]
     
  • This function moves the implementation found in the clip and br2684
    modules to common code, correctly unlinks the skb from the queue
    before pushing it and makes pppoatm use it.

    Signed-off-by: Jorge Boncompte [DTI2]
    Signed-off-by: David S. Miller

    Jorge Boncompte [DTI2]
     

27 Jul, 2011

1 commit

  • This allows us to move duplicated code in
    (atomic_inc_not_zero() for now) to

    Signed-off-by: Arun Sharma
    Reviewed-by: Eric Dumazet
    Cc: Ingo Molnar
    Cc: David Miller
    Cc: Eric Dumazet
    Acked-by: Mike Frysinger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arun Sharma
     

31 Mar, 2011

1 commit

  • Don't flap VCs when carrier state changes; higher-level protocols
    can detect loss of connectivity and act accordingly. This is more
    consistent with how other network interfaces work.

    We no longer use release_vccs() so we can delete it.

    release_vccs() was duplicated from net/atm/common.c; make the
    corresponding function exported, since other code duplicates it
    and could leverage it if it were public.

    Signed-off-by: Philip A. Prindeville
    Signed-off-by: David S. Miller

    Philip A. Prindeville
     

17 Aug, 2010

1 commit

  • Outdent the code following an if.

    The semantic match that finds this problem is as follows:
    (http://coccinelle.lip6.fr/)

    //
    @r disable braces4@
    position p1,p2;
    statement S1,S2;
    @@

    (
    if (...) { ... }
    |
    if (...) S1@p1 S2@p2
    )

    @script:python@
    p1 << r.p1;
    p2 << r.p2;
    @@

    if (p1[0].column == p2[0].column):
    cocci.print_main("branch",p1)
    cocci.print_secs("after",p2)
    //

    Signed-off-by: Julia Lawall
    Signed-off-by: David S. Miller

    Julia Lawall
     

09 Jul, 2010

1 commit

  • Add notifier chain for changes in atm_dev.

    Clients like br2684 will call register_atmdevice_notifier() to be notified of
    changes. Drivers will call atm_dev_signal_change() to notify clients like
    br2684 of the change.

    On DSL and ATM devices it's usefull to have a know if you have a carrier
    signal. netdevice LOWER_UP changes can be propagated to userspace via netlink
    monitor.

    Signed-off-by: Karl Hiramoto
    Signed-off-by: David S. Miller

    Karl Hiramoto
     

02 May, 2010

1 commit

  • sk_callback_lock rwlock actually protects sk->sk_sleep pointer, so we
    need two atomic operations (and associated dirtying) per incoming
    packet.

    RCU conversion is pretty much needed :

    1) Add a new structure, called "struct socket_wq" to hold all fields
    that will need rcu_read_lock() protection (currently: a
    wait_queue_head_t and a struct fasync_struct pointer).

    [Future patch will add a list anchor for wakeup coalescing]

    2) Attach one of such structure to each "struct socket" created in
    sock_alloc_inode().

    3) Respect RCU grace period when freeing a "struct socket_wq"

    4) Change sk_sleep pointer in "struct sock" by sk_wq, pointer to "struct
    socket_wq"

    5) Change sk_sleep() function to use new sk->sk_wq instead of
    sk->sk_sleep

    6) Change sk_has_sleeper() to wq_has_sleeper() that must be used inside
    a rcu_read_lock() section.

    7) Change all sk_has_sleeper() callers to :
    - Use rcu_read_lock() instead of read_lock(&sk->sk_callback_lock)
    - Use wq_has_sleeper() to eventually wakeup tasks.
    - Use rcu_read_unlock() instead of read_unlock(&sk->sk_callback_lock)

    8) sock_wake_async() is modified to use rcu protection as well.

    9) Exceptions :
    macvtap, drivers/net/tun.c, af_unix use integrated "struct socket_wq"
    instead of dynamically allocated ones. They dont need rcu freeing.

    Some cleanups or followups are probably needed, (possible
    sk_callback_lock conversion to a spinlock for example...).

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

21 Apr, 2010

1 commit

  • Define a new function to return the waitqueue of a "struct sock".

    static inline wait_queue_head_t *sk_sleep(struct sock *sk)
    {
    return sk->sk_sleep;
    }

    Change all read occurrences of sk_sleep by a call to this function.

    Needed for a future RCU conversion. sk_sleep wont be a field directly
    available.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

27 Jan, 2010

2 commits


13 Oct, 2009

1 commit

  • Create a new socket level option to report number of queue overflows

    Recently I augmented the AF_PACKET protocol to report the number of frames lost
    on the socket receive queue between any two enqueued frames. This value was
    exported via a SOL_PACKET level cmsg. AFter I completed that work it was
    requested that this feature be generalized so that any datagram oriented socket
    could make use of this option. As such I've created this patch, It creates a
    new SOL_SOCKET level option called SO_RXQ_OVFL, which when enabled exports a
    SOL_SOCKET level cmsg that reports the nubmer of times the sk_receive_queue
    overflowed between any two given frames. It also augments the AF_PACKET
    protocol to take advantage of this new feature (as it previously did not touch
    sk->sk_drops, which this patch uses to record the overflow count). Tested
    successfully by me.

    Notes:

    1) Unlike my previous patch, this patch simply records the sk_drops value, which
    is not a number of drops between packets, but rather a total number of drops.
    Deltas must be computed in user space.

    2) While this patch currently works with datagram oriented protocols, it will
    also be accepted by non-datagram oriented protocols. I'm not sure if thats
    agreeable to everyone, but my argument in favor of doing so is that, for those
    protocols which aren't applicable to this option, sk_drops will always be zero,
    and reporting no drops on a receive queue that isn't used for those
    non-participating protocols seems reasonable to me. This also saves us having
    to code in a per-protocol opt in mechanism.

    3) This applies cleanly to net-next assuming that commit
    977750076d98c7ff6cbda51858bb5a5894a9d9ab (my af packet cmsg patch) is reverted

    Signed-off-by: Neil Horman
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Neil Horman
     

01 Oct, 2009

1 commit

  • This provides safety against negative optlen at the type
    level instead of depending upon (sometimes non-trivial)
    checks against this sprinkled all over the the place, in
    each and every implementation.

    Based upon work done by Arjan van de Ven and feedback
    from Linus Torvalds.

    Signed-off-by: David S. Miller

    David S. Miller
     

10 Jul, 2009

1 commit

  • Adding memory barrier after the poll_wait function, paired with
    receive callbacks. Adding fuctions sock_poll_wait and sk_has_sleeper
    to wrap the memory barrier.

    Without the memory barrier, following race can happen.
    The race fires, when following code paths meet, and the tp->rcv_nxt
    and __add_wait_queue updates stay in CPU caches.

    CPU1 CPU2

    sys_select receive packet
    ... ...
    __add_wait_queue update tp->rcv_nxt
    ... ...
    tp->rcv_nxt check sock_def_readable
    ... {
    schedule ...
    if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
    wake_up_interruptible(sk->sk_sleep)
    ...
    }

    If there was no cache the code would work ok, since the wait_queue and
    rcv_nxt are opposit to each other.

    Meaning that once tp->rcv_nxt is updated by CPU2, the CPU1 either already
    passed the tp->rcv_nxt check and sleeps, or will get the new value for
    tp->rcv_nxt and will return with new data mask.
    In both cases the process (CPU1) is being added to the wait queue, so the
    waitqueue_active (CPU2) call cannot miss and will wake up CPU1.

    The bad case is when the __add_wait_queue changes done by CPU1 stay in its
    cache, and so does the tp->rcv_nxt update on CPU2 side. The CPU1 will then
    endup calling schedule and sleep forever if there are no more data on the
    socket.

    Calls to poll_wait in following modules were ommited:
    net/bluetooth/af_bluetooth.c
    net/irda/af_irda.c
    net/irda/irnet/irnet_ppp.c
    net/mac80211/rc80211_pid_debugfs.c
    net/phonet/socket.c
    net/rds/af_rds.c
    net/rfkill/core.c
    net/sunrpc/cache.c
    net/sunrpc/rpc_pipe.c
    net/tipc/socket.c

    Signed-off-by: Jiri Olsa
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Jiri Olsa