11 Oct, 2007

2 commits

  • Changes asserts in sunrpc to use sock_owned_by_user() macro instead of
    referencing sock_lock.owner directly.

    Signed-off-by: John Heffner
    Signed-off-by: David S. Miller

    John Heffner
     
  • Hopefully captured all single statement cases under net/. I'm
    not too sure if there is some policy about #includes that are
    "guaranteed" (ie., in the current tree) to be available through
    some other #included header, so I just added linux/kernel.h to
    each changed file that didn't #include it previously.

    Signed-off-by: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Ilpo Järvinen
     

21 Sep, 2007

1 commit

  • we upgraded the kernel of a nfs-server from 2.6.17.11 to 2.6.22.6. Since
    then we get the message

    lockd: too many open TCP sockets, consider increasing the number of nfsd threads
    lockd: last TCP connect from ^\\236^\É^D

    These random characters in the second line are caused by a bug in
    svc_tcp_accept.

    (Note: there are two previous __svc_print_addr(sin, buf, sizeof(buf))
    calls in this function, either of which would initialize buf correctly;
    but both are inside "if"'s and are not necessarily executed. This is
    less obvious in the second case, which is inside a dprintk(), which is a
    macro which expands to an if statement.)

    Signed-off-by: Wolfgang Walter
    Signed-off-by: J. Bruce Fields
    Signed-off-by: Linus Torvalds

    Wolfgang Walter
     

15 Sep, 2007

1 commit

  • Commit aaf68cfbf2241d24d46583423f6bff5c47e088b3 added a bias
    to sk_inuse, so this test for an unused socket now fails. So no
    sockets get closed because they are old (they might get closed
    if the client closed them).

    This bug has existed since 2.6.21-rc1.

    Thanks to Wolfgang Walter for finding and reporting the bug.

    Cc: Wolfgang Walter
    Signed-off-by: Neil Brown
    Signed-off-by: J. Bruce Fields
    Signed-off-by: Linus Torvalds

    Neil Brown
     

27 Jul, 2007

1 commit


11 Jul, 2007

1 commit


10 May, 2007

2 commits

  • This while loop has an overly complex condition, which performs a couple of
    assignments. This hurts readability.

    We don't really need a loop at all. We can just return -EAGAIN and (providing
    we set SK_DATA), the function will be called again.

    So discard the loop, make the complex conditional become a few clear function
    calls, and hopefully improve readability.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • Now that sk_defer_lock protects two different things, make the name more
    generic.

    Also don't bother with disabling _bh as the lock is only ever taken from
    process context.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     

26 Apr, 2007

1 commit

  • We currently use a special structure (struct skb_timeval) and plain
    'struct timeval' to store packet timestamps in sk_buffs and struct
    sock.

    This has some drawbacks :
    - Fixed resolution of micro second.
    - Waste of space on 64bit platforms where sizeof(struct timeval)=16

    I suggest using ktime_t that is a nice abstraction of high resolution
    time services, currently capable of nanosecond resolution.

    As sizeof(ktime_t) is 8 bytes, using ktime_t in 'struct sock' permits
    a 8 byte shrink of this structure on 64bit architectures. Some other
    structures also benefit from this size reduction (struct ipq in
    ipv4/ip_fragment.c, struct frag_queue in ipv6/reassembly.c, ...)

    Once this ktime infrastructure adopted, we can more easily provide
    nanosecond resolution on top of it. (ioctl SIOCGSTAMPNS and/or
    SO_TIMESTAMPNS/SCM_TIMESTAMPNS)

    Note : this patch includes a bug correction in
    compat_sock_get_timestamp() where a "err = 0;" was missing (so this
    syscall returned -ENOENT instead of 0)

    Signed-off-by: Eric Dumazet
    CC: Stephen Hemminger
    CC: John find
    Signed-off-by: David S. Miller

    Eric Dumazet
     

13 Apr, 2007

1 commit


05 Apr, 2007

1 commit

  • The return value of kernel_recvmsg() should be assigned to "err", not
    compared with the random value of a never initialized "err" (and the "< 0"
    check wrongly always returned false since == comparisons never have a
    result < 0).

    Spotted by the Coverity checker.

    Signed-off-by: Adrian Bunk
    Acked-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     

07 Mar, 2007

3 commits

  • When the last thread of nfsd exits, it shuts down all related sockets. It
    currently uses svc_close_socket to do this, but that only is immediately
    effective if the socket is not SK_BUSY.

    If the socket is busy - i.e. if a request has arrived that has not yet been
    processes - svc_close_socket is not effective and the shutdown process spins.

    So create a new svc_force_close_socket which removes the SK_BUSY flag is set
    and then calls svc_close_socket.

    Also change some open-codes loops in svc_destroy to use
    list_for_each_entry_safe.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • They don't really save that much, and aren't worth the hassle.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • The sunrpc server code needs to know the source and destination address for
    UDP packets so it can reply properly. It currently copies code out of the
    network stack to pick the pieces out of the skb. This is ugly and causes
    compile problems with the IPv6 stuff.

    So, rip that out and use recv_msg instead. This is a much cleaner interface,
    but has a slight cost in that the checksum is now checked before the copy, so
    we don't benefit from doing both at the same time. This can probably be
    fixed.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     

13 Feb, 2007

14 commits


11 Feb, 2007

1 commit


10 Feb, 2007

1 commit

  • If you lose this race, it can iput a socket inode twice and you get a BUG
    in fs/inode.c

    When I added the option for user-space to close a socket, I added some
    cruft to svc_delete_socket so that I could call that function when closing
    a socket per user-space request.

    This was the wrong thing to do. I should have just set SK_CLOSE and let
    normal mechanisms do the work.

    Not only wrong, but buggy. The locking is all wrong and it openned up a
    race where-by a socket could be closed twice.

    So this patch:
    Introduces svc_close_socket which sets SK_CLOSE then either leave
    the close up to a thread, or calls svc_delete_socket if it can
    get SK_BUSY.

    Adds a bias to sk_busy which is removed when SK_DEAD is set,
    This avoid races around shutting down the socket.

    Changes several 'spin_lock' to 'spin_lock_bh' where the _bh
    was missing.

    Bugzilla-url: http://bugzilla.kernel.org/show_bug.cgi?id=7916

    Signed-off-by: Neil Brown
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     

31 Jan, 2007

1 commit


27 Jan, 2007

1 commit

  • NFSd assumes that largest number of pages that will be needed for a
    request+response is 2+N where N pages is the size of the largest permitted
    read/write request. The '2' are 1 for the non-data part of the request, and 1
    for the non-data part of the reply.

    However, when a read request is not page-aligned, and we choose to use
    ->sendfile to send it directly from the page cache, we may need N+1 pages to
    hold the whole reply. This can overflow and array and cause an Oops.

    This patch increases size of the array for holding pages by one and makes sure
    that entry is NULL when it is not in use.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     

08 Dec, 2006

2 commits

  • Stick NFS sockets in their own class to avoid some lockdep warnings. NFS
    sockets are never exposed to user-space, and will hence not trigger certain
    code paths that would otherwise pose deadlock scenarios.

    [akpm@osdl.org: cleanups]
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Steven Dickson
    Acked-by: Ingo Molnar
    Cc: Trond Myklebust
    Acked-by: Neil Brown
    Cc: "David S. Miller"
    Signed-off-by: Andrew Morton
    [ Fixed patch corruption by quilt, pointed out by Peter Zijlstra ]
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Move process freezing functions from include/linux/sched.h to freezer.h, so
    that modifications to the freezer or the kernel configuration don't require
    recompiling just about everything.

    [akpm@osdl.org: fix ueagle driver]
    Signed-off-by: Nigel Cunningham
    Cc: "Rafael J. Wysocki"
    Cc: Pavel Machek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nigel Cunningham
     

31 Oct, 2006

2 commits

  • - printk should remain dprintk

    - fix coding-style.

    Cc: Neil Brown
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • A recent patch fixed a problem which would occur when the refcount on an
    auth_domain reached zero. This problem has not been reported in practice
    despite existing in two major kernel releases because the refcount can
    never reach zero.

    This patch fixes the problems that stop the refcount reaching zero.

    1/ We were adding to the refcount when inserting in the hash table,
    but only removing from the hashtable when the refcount reached zero.
    Obviously it never would. So don't count the implied reference of
    being in the hash table.

    2/ There are two paths on which a socket can be destroyed. One called
    svcauth_unix_info_release(). The other didn't. So when the other was
    taken, we can lose a reference to an ip_map which in-turn holds a
    reference to an auth_domain

    So unify the exit paths into svc_sock_put. This highlights the fact
    that svc_delete_socket has slightly odd semantics - it does not drop
    a reference but probably should. Fixing this need a bit more
    thought and testing.

    Signed-off-by: Neil Brown
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Neil Brown
     

21 Oct, 2006

1 commit

  • This patch is suitable for just about any 2.6 kernel. It should go in
    2.6.19 and 2.6.18.2 and possible even the .17 and .16 stable series.

    This is a long standing bug that seems to have only recently become
    apparent, presumably due to increasing use of NFS over TCP - many
    distros seem to be making it the default.

    The SK_CONN bit gets set when a listening socket may be ready
    for an accept, just as SK_DATA is set when data may be available.

    It is entirely possible for svc_tcp_accept to be called with neither
    of these set. It doesn't happen often but there is a small race in
    svc_sock_enqueue as SK_CONN and SK_DATA are tested outside the
    spin_lock. They could be cleared immediately after the test and
    before the lock is gained.

    This normally shouldn't be a problem. The sockets are non-blocking so
    trying to read() or accept() when ther is nothing to do is not a problem.

    However: svc_tcp_recvfrom makes the decision "Should I accept() or
    should I read()" based on whether SK_CONN is set or not. This usually
    works but is not safe. The decision should be based on whether it is
    a TCP_LISTEN socket or a TCP_CONNECTED socket.

    Signed-off-by: Neil Brown
    Cc: Adrian Bunk
    Cc:
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     

06 Oct, 2006

1 commit

  • There is some confusion about the meaning of 'bufsz' for a sunrpc server.
    In some cases it is the largest message that can be sent or received. In
    other cases it is the largest 'payload' that can be included in a NFS
    message.

    In either case, it is not possible for both the request and the reply to be
    this large. One of the request or reply may only be one page long, which
    fits nicely with NFS.

    So we remove 'bufsz' and replace it with two numbers: 'max_payload' and
    'max_mesg'. Max_payload is the size that the server requests. It is used
    by the server to check the max size allowed on a particular connection:
    depending on the protocol a lower limit might be used.

    max_mesg is the largest single message that can be sent or received. It is
    calculated as the max_payload, rounded up to a multiple of PAGE_SIZE, and
    with PAGE_SIZE added to overhead. Only one of the request and reply may be
    this size. The other must be at most one page.

    Cc: Greg Banks
    Cc: "J. Bruce Fields"
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     

04 Oct, 2006

2 commits

  • Speed up high call-rate workloads by caching the struct ip_map for the peer on
    the connected struct svc_sock instead of looking it up in the ip_map cache
    hashtable on every call. This helps workloads using AUTH_SYS authentication
    over TCP.

    Testing was on a 4 CPU 4 NIC Altix using 4 IRIX clients, each with 16
    synthetic client threads simulating an rsync (i.e. recursive directory
    listing) workload reading from an i386 RH9 install image (161480 regular files
    in 10841 directories) on the server. That tree is small enough to fill in the
    server's RAM so no disk traffic was involved. This setup gives a sustained
    call rate in excess of 60000 calls/sec before being CPU-bound on the server.

    Profiling showed strcmp(), called from ip_map_match(), was taking 4.8% of each
    CPU, and ip_map_lookup() was taking 2.9%. This patch drops both contribution
    into the profile noise.

    Note that the above result overstates this value of this patch for most
    workloads. The synthetic clients are all using separate IP addresses, so
    there are 64 entries in the ip_map cache hash. Because the kernel measured
    contained the bug fixed in commit

    commit 1f1e030bf75774b6a283518e1534d598e14147d4

    and was running on 64bit little-endian machine, probably all of those 64
    entries were on a single chain, thus increasing the cost of ip_map_lookup().

    With a modern kernel you would need more clients to see the same amount of
    performance improvement. This patch has helped to scale knfsd to handle a
    deployment with 2000 NFS clients.

    Signed-off-by: Greg Banks
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Banks
     
  • .. by allocating the array of 'kvec' in 'struct svc_rqst'.

    As we plan to increase RPCSVC_MAXPAGES from 8 upto 256, we can no longer
    allocate an array of this size on the stack. So we allocate it in 'struct
    svc_rqst'.

    However svc_rqst contains (indirectly) an array of the same type and size
    (actually several, but they are in a union). So rather than waste space, we
    move those arrays out of the separately allocated union and into svc_rqst to
    share with the kvec moved out of svc_tcp_recvfrom (various arrays are used at
    different times, so there is no conflict).

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown