08 Dec, 2006

2 commits

  • Stick NFS sockets in their own class to avoid some lockdep warnings. NFS
    sockets are never exposed to user-space, and will hence not trigger certain
    code paths that would otherwise pose deadlock scenarios.

    [akpm@osdl.org: cleanups]
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Steven Dickson
    Acked-by: Ingo Molnar
    Cc: Trond Myklebust
    Acked-by: Neil Brown
    Cc: "David S. Miller"
    Signed-off-by: Andrew Morton
    [ Fixed patch corruption by quilt, pointed out by Peter Zijlstra ]
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Move process freezing functions from include/linux/sched.h to freezer.h, so
    that modifications to the freezer or the kernel configuration don't require
    recompiling just about everything.

    [akpm@osdl.org: fix ueagle driver]
    Signed-off-by: Nigel Cunningham
    Cc: "Rafael J. Wysocki"
    Cc: Pavel Machek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nigel Cunningham
     

31 Oct, 2006

2 commits

  • - printk should remain dprintk

    - fix coding-style.

    Cc: Neil Brown
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • A recent patch fixed a problem which would occur when the refcount on an
    auth_domain reached zero. This problem has not been reported in practice
    despite existing in two major kernel releases because the refcount can
    never reach zero.

    This patch fixes the problems that stop the refcount reaching zero.

    1/ We were adding to the refcount when inserting in the hash table,
    but only removing from the hashtable when the refcount reached zero.
    Obviously it never would. So don't count the implied reference of
    being in the hash table.

    2/ There are two paths on which a socket can be destroyed. One called
    svcauth_unix_info_release(). The other didn't. So when the other was
    taken, we can lose a reference to an ip_map which in-turn holds a
    reference to an auth_domain

    So unify the exit paths into svc_sock_put. This highlights the fact
    that svc_delete_socket has slightly odd semantics - it does not drop
    a reference but probably should. Fixing this need a bit more
    thought and testing.

    Signed-off-by: Neil Brown
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Neil Brown
     

21 Oct, 2006

1 commit

  • This patch is suitable for just about any 2.6 kernel. It should go in
    2.6.19 and 2.6.18.2 and possible even the .17 and .16 stable series.

    This is a long standing bug that seems to have only recently become
    apparent, presumably due to increasing use of NFS over TCP - many
    distros seem to be making it the default.

    The SK_CONN bit gets set when a listening socket may be ready
    for an accept, just as SK_DATA is set when data may be available.

    It is entirely possible for svc_tcp_accept to be called with neither
    of these set. It doesn't happen often but there is a small race in
    svc_sock_enqueue as SK_CONN and SK_DATA are tested outside the
    spin_lock. They could be cleared immediately after the test and
    before the lock is gained.

    This normally shouldn't be a problem. The sockets are non-blocking so
    trying to read() or accept() when ther is nothing to do is not a problem.

    However: svc_tcp_recvfrom makes the decision "Should I accept() or
    should I read()" based on whether SK_CONN is set or not. This usually
    works but is not safe. The decision should be based on whether it is
    a TCP_LISTEN socket or a TCP_CONNECTED socket.

    Signed-off-by: Neil Brown
    Cc: Adrian Bunk
    Cc:
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     

06 Oct, 2006

1 commit

  • There is some confusion about the meaning of 'bufsz' for a sunrpc server.
    In some cases it is the largest message that can be sent or received. In
    other cases it is the largest 'payload' that can be included in a NFS
    message.

    In either case, it is not possible for both the request and the reply to be
    this large. One of the request or reply may only be one page long, which
    fits nicely with NFS.

    So we remove 'bufsz' and replace it with two numbers: 'max_payload' and
    'max_mesg'. Max_payload is the size that the server requests. It is used
    by the server to check the max size allowed on a particular connection:
    depending on the protocol a lower limit might be used.

    max_mesg is the largest single message that can be sent or received. It is
    calculated as the max_payload, rounded up to a multiple of PAGE_SIZE, and
    with PAGE_SIZE added to overhead. Only one of the request and reply may be
    this size. The other must be at most one page.

    Cc: Greg Banks
    Cc: "J. Bruce Fields"
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     

04 Oct, 2006

5 commits

  • Speed up high call-rate workloads by caching the struct ip_map for the peer on
    the connected struct svc_sock instead of looking it up in the ip_map cache
    hashtable on every call. This helps workloads using AUTH_SYS authentication
    over TCP.

    Testing was on a 4 CPU 4 NIC Altix using 4 IRIX clients, each with 16
    synthetic client threads simulating an rsync (i.e. recursive directory
    listing) workload reading from an i386 RH9 install image (161480 regular files
    in 10841 directories) on the server. That tree is small enough to fill in the
    server's RAM so no disk traffic was involved. This setup gives a sustained
    call rate in excess of 60000 calls/sec before being CPU-bound on the server.

    Profiling showed strcmp(), called from ip_map_match(), was taking 4.8% of each
    CPU, and ip_map_lookup() was taking 2.9%. This patch drops both contribution
    into the profile noise.

    Note that the above result overstates this value of this patch for most
    workloads. The synthetic clients are all using separate IP addresses, so
    there are 64 entries in the ip_map cache hash. Because the kernel measured
    contained the bug fixed in commit

    commit 1f1e030bf75774b6a283518e1534d598e14147d4

    and was running on 64bit little-endian machine, probably all of those 64
    entries were on a single chain, thus increasing the cost of ip_map_lookup().

    With a modern kernel you would need more clients to see the same amount of
    performance improvement. This patch has helped to scale knfsd to handle a
    deployment with 2000 NFS clients.

    Signed-off-by: Greg Banks
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Banks
     
  • .. by allocating the array of 'kvec' in 'struct svc_rqst'.

    As we plan to increase RPCSVC_MAXPAGES from 8 upto 256, we can no longer
    allocate an array of this size on the stack. So we allocate it in 'struct
    svc_rqst'.

    However svc_rqst contains (indirectly) an array of the same type and size
    (actually several, but they are in a union). So rather than waste space, we
    move those arrays out of the separately allocated union and into svc_rqst to
    share with the kvec moved out of svc_tcp_recvfrom (various arrays are used at
    different times, so there is no conflict).

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • We are planning to increase RPCSVC_MAXPAGES from about 8 to about 256. This
    means we need to be a bit careful about arrays of size RPCSVC_MAXPAGES.

    struct svc_rqst contains two such arrays. However the there are never more
    that RPCSVC_MAXPAGES pages in the two arrays together, so only one array is
    needed.

    The two arrays are for the pages holding the request, and the pages holding
    the reply. Instead of two arrays, we can simply keep an index into where the
    first reply page is.

    This patch also removes a number of small inline functions that probably
    server to obscure what is going on rather than clarify it, and opencode the
    needed functionality.

    Also remove the 'rq_restailpage' variable as it is *always* 0. i.e. if the
    response 'xdr' structure has a non-empty tail it is always in the same pages
    as the head.

    check counters are initilised and incr properly
    check for consistant usage of ++ etc
    maybe extra some inlines for common approach
    general review

    Signed-off-by: Neil Brown
    Cc: Magnus Maatta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • Arrgg.. We cannot 'lockd_up' before 'svc_addsock' as we don't know the
    protocol yet.... So switch it around again and save the name of the created
    sockets so that it can be closed if lock_up fails.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • The refcount that nfsd holds on lockd is based on the number of open sockets.
    So when we close a socket, we should decrement the ref (with lockd_down).

    Currently when a socket is closed via writing to the portlist file, that
    doesn't happen.

    So: make sure we get an error return if the socket that was requested does is
    not found, and call lockd_down if it was.

    Cc: "J. Bruce Fields"
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     

02 Oct, 2006

10 commits

  • Actually implement multiple pools. On NUMA machines, allocate a svc_pool per
    NUMA node; on SMP a svc_pool per CPU; otherwise a single global pool. Enqueue
    sockets on the svc_pool corresponding to the CPU on which the socket bh is run
    (i.e. the NIC interrupt CPU). Threads have their cpu mask set to limit them
    to the CPUs in the svc_pool that owns them.

    This is the patch that allows an Altix to scale NFS traffic linearly
    beyond 4 CPUs and 4 NICs.

    Incorporates changes and feedback from Neil Brown, Trond Myklebust, and
    Christoph Hellwig.

    Signed-off-by: Greg Banks
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Banks
     
  • Split out the list of idle threads and pending sockets from svc_serv into a
    new svc_pool structure, and allocate a fixed number (in this patch, 1) of
    pools per svc_serv. The new structure contains a lock which takes over
    several of the duties of svc_serv->sv_lock, which is now relegated to
    protecting only sv_tempsocks, sv_permsocks, and sv_tmpcnt in svc_serv.

    The point is to move the hottest fields out of svc_serv and into svc_pool,
    allowing a following patch to arrange for a svc_pool per NUMA node or per CPU.
    This is a major step towards making the NFS server NUMA-friendly.

    Signed-off-by: Greg Banks
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Banks
     
  • The SK_BUSY bit in svc_sock->sk_flags ensures that we do not attempt to
    enqueue a socket twice. Currently, setting and clearing the bit is protected
    by svc_serv->sv_lock. As I intend to reduce the data that the lock protects
    so it's not held when svc_sock_enqueue() tests and sets SK_BUSY, that test and
    set needs to be atomic.

    Signed-off-by: Greg Banks
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Banks
     
  • Convert the svc_sock->sk_reserved variable from an int protected by
    svc_serv->sv_lock, to an atomic. This reduces (by 1) the number of places we
    need to take the (effectively global) svc_serv->sv_lock.

    Signed-off-by: Greg Banks
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Banks
     
  • Protect the svc_sock->sk_deferred list with a new lock svc_sock->sk_defer_lock
    instead of svc_serv->sv_lock. Using the more fine-grained lock reduces the
    number of places we need to take the svc_serv lock.

    Signed-off-by: Greg Banks
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Banks
     
  • Convert the svc_sock->sk_inuse counter from an int protected by
    svc_serv->sv_lock, to an atomic. This reduces the number of places we need to
    take the (effectively global) svc_serv->sv_lock.

    Signed-off-by: Greg Banks
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Banks
     
  • Following are 11 patches from Greg Banks which combine to make knfsd more
    Numa-aware. They reduce hitting on 'global' data structures, and create some
    data-structures that can be node-local.

    knfsd threads are bound to a particular node, and the thread to handle a new
    request is chosen from the threads that are attach to the node that received
    the interrupt.

    The distribution of threads across nodes can be controlled by a new file in
    the 'nfsd' filesystem, though the default approach of an even spread is
    probably fine for most sites.

    Some (old) numbers that show the efficacy of these patches: N == number of
    NICs == number of CPUs == nmber of clients. Number of NUMA nodes == N/2

    N Throughput, MiB/s CPU usage, % (max=N*100)
    Before After Before After
    --- ------ ---- ----- -----
    4 312 435 350 228
    6 500 656 501 418
    8 562 804 690 589

    This patch:

    Move the aging of RPC/TCP connection sockets from the main svc_recv() loop to
    a timer which uses a mark-and-sweep algorithm every 6 minutes. This reduces
    the amount of work that needs to be done in the main RPC loop and the length
    of time we need to hold the (effectively global) svc_serv->sv_lock.

    [akpm@osdl.org: cleanup]
    Signed-off-by: Greg Banks
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Banks
     
  • It isn't needed as it is available in rqstp->rq_server, and dropping it allows
    some local vars to be dropped.

    [akpm@osdl.org: build fix]
    Cc: "J. Bruce Fields"
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • Userspace should create and bind a socket (but not connectted) and write the
    'fd' to portlist. This will cause the nfs server to listen on that socket.

    To close a socket, the name of the socket - as read from 'portlist' can be
    written to 'portlist' with a preceding '-'.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • This file will list all ports that nfsd has open.
    Default when TCP enabled will be
    ipv4 udp 0.0.0.0 2049
    ipv4 tcp 0.0.0.0 2049

    Later, the list of ports will be settable.

    'portlist' chosen rather than 'ports', to avoid unnecessary confusion with
    non-mainline patches which created 'ports' with different semantics.

    [akpm@osdl.org: cleanups, build fix]
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     

29 Sep, 2006

2 commits


23 Sep, 2006

1 commit


22 Jul, 2006

1 commit


21 Mar, 2006

1 commit

  • Semaphore to mutex conversion.

    The conversion was generated via scripts, and the result was validated
    automatically via a script as well.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: David S. Miller

    Ingo Molnar
     

19 Jan, 2006

1 commit


07 Jan, 2006

1 commit

  • I submitted this one previously - svc_tcp_recvfrom currently returns
    any errors to the caller, including ECONNRESET and the like.

    This is something svc_recv isn't able to deal with:

    len = svsk->sk_recvfrom(rqstp);
    [...]
    if (len == 0 || len == -EAGAIN) {
    [...]
    return -EAGAIN;
    }

    [...]
    return len;

    The nfsd main loop will exit when it sees an error code other than
    EAGAIN.

    The following patch fixes this problem

    svc_recv is not equipped to deal with error codes other than EAGAIN,
    and will propagate anything else (such as ECONNRESET) up to nfsd,
    causing it to exit.

    Signed-off-by: Olaf Kirch
    Cc: Trond Myklebust
    Cc: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Olaf Kirch
     

04 Jan, 2006

1 commit

  • I noticed that some of 'struct proto_ops' used in the kernel may share
    a cache line used by locks or other heavily modified data. (default
    linker alignement is 32 bytes, and L1_CACHE_LINE is 64 or 128 at
    least)

    This patch makes sure a 'struct proto_ops' can be declared as const,
    so that all cpus can share all parts of it without false sharing.

    This is not mandatory : a driver can still use a read/write structure
    if it needs to (and eventually a __read_mostly)

    I made a global stubstitute to change all existing occurences to make
    them const.

    This should reduce the possibility of false sharing on SMP, and
    speedup some socket system calls.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

16 Nov, 2005

1 commit

  • Being kernel-threads, nfsd servers don't get pre-empted (depending on
    CONFIG). If there is a steady stream of NFS requests that can be served
    from cache, an nfsd thread may hold on to a cpu indefinitely, which isn't
    very friendly.

    So it is good to have a cond_resched in there (just before looking for a
    new request to serve), to make sure we play nice.

    Signed-off-by: Neil Brown
    Acked-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     

11 Nov, 2005

1 commit

  • Here is the patch that introduces the generic skb_checksum_complete
    which also checks for hardware RX checksum faults. If that happens,
    it'll call netdev_rx_csum_fault which currently prints out a stack
    trace with the device name. In future it can turn off RX checksum.

    I've converted every spot under net/ that does RX checksum checks to
    use skb_checksum_complete or __skb_checksum_complete with the
    exceptions of:

    * Those places where checksums are done bit by bit. These will call
    netdev_rx_csum_fault directly.

    * The following have not been completely checked/converted:

    ipmr
    ip_vs
    netfilter
    dccp

    This patch is based on patches and suggestions from Stephen Hemminger
    and David S. Miller.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     

28 Oct, 2005

1 commit


27 Oct, 2005

1 commit


24 Sep, 2005

1 commit

  • Clean-up: Move some code that is common to both RPC client- and server-side
    socket transports into its own source file, net/sunrpc/socklib.c.

    Test-plan:
    Compile kernel with CONFIG_NFS enabled. Millions of fsx operations over
    UDP, client and server. Connectathon over UDP.

    Version: Thu, 11 Aug 2005 16:03:09 -0400

    Signed-off-by: Chuck Lever
    Signed-off-by: Trond Myklebust

    Chuck Lever
     

13 Sep, 2005

2 commits

  • Change a printk(KERN_WARNING to dprintk, and it is really only interesting
    when trying to debug a problem, and can occur normally without error.

    Remove various gratuitous gotos in surrounding code, and remove some
    type-cast assignments from inside 'if' conditionals, as that is just
    obscuring what it going on.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Neil Brown
     
  • Use schedule_timeout_{,un}interruptible() instead of
    set_current_state()/schedule_timeout() to reduce kernel size. Also use
    human-time conversion functions instead of hard-coded division to avoid
    rounding issues.

    Signed-off-by: Nishanth Aravamudan
    Signed-off-by: Andrew Morton
    Signed-off-by: David S. Miller

    Nishanth Aravamudan
     

30 Aug, 2005

3 commits


10 Aug, 2005

1 commit