02 Oct, 2006

40 commits

  • This adds the new kernel_execve function on all architectures that were using
    _syscall3() to implement execve.

    The implementation uses code from the _syscall3 macros provided in the
    unistd.h header file. I don't have cross-compilers for any of these
    architectures, so the patch is untested with the exception of i386.

    Most architectures can probably implement this in a nicer way in assembly or
    by combining it with the sys_execve implementation itself, but this should do
    it for now.

    [bunk@stusta.de: m68knommu build fix]
    [markh@osdl.org: build fix]
    [bero@arklinux.org: build fix]
    [ralf@linux-mips.org: mips fix]
    [schwidefsky@de.ibm.com: s390 fix]
    Signed-off-by: Arnd Bergmann
    Cc: Andi Kleen
    Cc: Paul Mackerras
    Cc: Benjamin Herrenschmidt
    Cc: Richard Henderson
    Cc: Ivan Kokshaysky
    Cc: Russell King
    Cc: Ian Molton
    Cc: Mikael Starvik
    Cc: David Howells
    Cc: Yoshinori Sato
    Cc: Hirokazu Takata
    Cc: Ralf Baechle
    Cc: Kyle McMartin
    Cc: Heiko Carstens
    Cc: Martin Schwidefsky
    Cc: Paul Mundt
    Cc: Kazumoto Kojima
    Cc: Richard Curnow
    Cc: William Lee Irwin III
    Cc: "David S. Miller"
    Cc: Jeff Dike
    Cc: Paolo 'Blaisorblade' Giarrusso
    Cc: Miles Bader
    Cc: Chris Zankel
    Cc: "Luck, Tony"
    Cc: Geert Uytterhoeven
    Cc: Roman Zippel
    Signed-off-by: Ralf Baechle
    Signed-off-by: Bernhard Rosenkraenzer
    Signed-off-by: Mark Haverkamp
    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arnd Bergmann
     
  • Some architectures provide an execve function that does not set errno, but
    instead returns the result code directly. Rename these to kernel_execve to
    get the right semantics there. Moreover, there is no reasone for any of these
    architectures to still provide __KERNEL_SYSCALLS__ or _syscallN macros, so
    remove these right away.

    [akpm@osdl.org: build fix]
    [bunk@stusta.de: build fix]
    Signed-off-by: Arnd Bergmann
    Cc: Andi Kleen
    Acked-by: Paul Mackerras
    Cc: Benjamin Herrenschmidt
    Cc: Richard Henderson
    Cc: Ivan Kokshaysky
    Cc: Russell King
    Cc: Ian Molton
    Cc: Mikael Starvik
    Cc: David Howells
    Cc: Yoshinori Sato
    Cc: Hirokazu Takata
    Cc: Ralf Baechle
    Cc: Kyle McMartin
    Cc: Heiko Carstens
    Cc: Martin Schwidefsky
    Cc: Paul Mundt
    Cc: Kazumoto Kojima
    Cc: Richard Curnow
    Cc: William Lee Irwin III
    Cc: "David S. Miller"
    Cc: Jeff Dike
    Cc: Paolo 'Blaisorblade' Giarrusso
    Cc: Miles Bader
    Cc: Chris Zankel
    Cc: "Luck, Tony"
    Cc: Geert Uytterhoeven
    Cc: Roman Zippel
    Signed-off-by: Adrian Bunk
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arnd Bergmann
     
  • The use of execve() in the kernel is dubious, since it relies on the
    __KERNEL_SYSCALLS__ mechanism that stores the result in a global errno
    variable. As a first step of getting rid of this, change all users to a
    global kernel_execve function that returns a proper error code.

    This function is a terrible hack, and a later patch removes it again after the
    kernel syscalls are gone.

    Signed-off-by: Arnd Bergmann
    Cc: Andi Kleen
    Cc: Paul Mackerras
    Cc: Benjamin Herrenschmidt
    Cc: Richard Henderson
    Cc: Ivan Kokshaysky
    Cc: Russell King
    Cc: Ian Molton
    Cc: Mikael Starvik
    Cc: David Howells
    Cc: Yoshinori Sato
    Cc: Hirokazu Takata
    Cc: Ralf Baechle
    Cc: Kyle McMartin
    Cc: Heiko Carstens
    Cc: Martin Schwidefsky
    Cc: Paul Mundt
    Cc: Kazumoto Kojima
    Cc: Richard Curnow
    Cc: William Lee Irwin III
    Cc: "David S. Miller"
    Cc: Jeff Dike
    Cc: Paolo 'Blaisorblade' Giarrusso
    Cc: Miles Bader
    Cc: Chris Zankel
    Cc: "Luck, Tony"
    Cc: Geert Uytterhoeven
    Cc: Roman Zippel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arnd Bergmann
     
  • Simplify get_undo_list() by dropping the unnecessary cast, removing the
    size variable, and switching to kzalloc() instead of a kmalloc() followed
    by a memset().

    This cleanup was split then modified from Jes Sorenson's Task Notifiers
    patches.

    Signed-off-by: Matt Helsley
    Cc: Jes Sorensen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matt Helsley
     
  • This patch fixes copy_namespaces()'s error path.

    when new nsproxy (new_ns) is created pointers to namespaces (ipc, uts) are
    copied from the old nsproxy. Later in copy_utsname, copy_ipcs, etc.
    according namespaces are get-ed. On error path needed namespaces are
    put-ed, so there's no need to put new nsproxy itelf as it woud cause
    putting namespaces for the second time.

    Found when incorporating namespaces into OpenVZ kernel.

    Signed-off-by: Pavel Emelianov
    Acked-by: Serge Hallyn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel
     
  • Sysctl tweaks for IPC namespace

    Signed-off-by: Pavel Emelianiov
    Signed-off-by: Kirill Korotaev
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Korotaev
     
  • IPC namespace support for IPC shm code.

    Signed-off-by: Pavel Emelianiov
    Signed-off-by: Kirill Korotaev
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Korotaev
     
  • IPC namespace support for IPC sem code.

    Signed-off-by: Pavel Emelianiov
    Signed-off-by: Kirill Korotaev
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Korotaev
     
  • IPC namespace support for IPC msg code.

    Signed-off-by: Pavel Emelianiov
    Signed-off-by: Kirill Korotaev
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Korotaev
     
  • This patch adds basic IPC namespace functionality to
    IPC utils:
    - init_ipc_ns
    - copy/clone/unshare/free IPC ns
    - /proc preparations

    Signed-off-by: Pavel Emelianov
    Signed-off-by: Kirill Korotaev
    Cc: "Eric W. Biederman"
    Cc: Cedric Le Goater
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Korotaev
     
  • This patch set allows to unshare IPCs and have a private set of IPC objects
    (sem, shm, msg) inside namespace. Basically, it is another building block of
    containers functionality.

    This patch implements core IPC namespace changes:
    - ipc_namespace structure
    - new config option CONFIG_IPC_NS
    - adds CLONE_NEWIPC flag
    - unshare support

    [clg@fr.ibm.com: small fix for unshare of ipc namespace]
    [akpm@osdl.org: build fix]
    Signed-off-by: Pavel Emelianov
    Signed-off-by: Kirill Korotaev
    Signed-off-by: Cedric Le Goater
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Korotaev
     
  • The nsproxy was being copied in unshare() when anything was being unshared,
    even if it was something not referenced from nsproxy. This should end up
    in some cases with far more memory usage than necessary.

    Signed-off-by: Serge Hallyn
    Cc: Kirill Korotaev
    Cc: "Eric W. Biederman"
    Cc: Herbert Poetzl
    Cc: Andrey Savochkin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge Hallyn
     
  • Implement a CLONE_NEWUTS flag, and use it at clone and sys_unshare.

    [clg@fr.ibm.com: IPC unshare fix]
    [bunk@stusta.de: cleanup]
    Signed-off-by: Serge Hallyn
    Cc: Kirill Korotaev
    Cc: "Eric W. Biederman"
    Cc: Herbert Poetzl
    Cc: Andrey Savochkin
    Signed-off-by: Adrian Bunk
    Signed-off-by: Cedric Le Goater
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     
  • The system_utsname isn't needed now that kernel/sysctl.c is fixed.
    Nuke it.

    Signed-off-by: Serge E. Hallyn
    Cc: Kirill Korotaev
    Cc: "Eric W. Biederman"
    Cc: Herbert Poetzl
    Cc: Andrey Savochkin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     
  • Sysctl uts patch. This will need to be done another way, but since sysctl
    itself needs to be container aware, 'the right thing' is a separate patchset.

    [akpm@osdl.org: ia64 build fix]
    [sam.vilain@catalyst.net.nz: cleanup]
    [sam.vilain@catalyst.net.nz: add proc_do_utsns_string]
    Signed-off-by: Serge E. Hallyn
    Cc: Kirill Korotaev
    Cc: "Eric W. Biederman"
    Cc: Herbert Poetzl
    Cc: Andrey Savochkin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     
  • This patch defines the uts namespace and some manipulators.
    Adds the uts namespace to task_struct, and initializes a
    system-wide init namespace.

    It leaves a #define for system_utsname so sysctl will compile.
    This define will be removed in a separate patch.

    [akpm@osdl.org: build fix, cleanup]
    Signed-off-by: Serge Hallyn
    Cc: Kirill Korotaev
    Cc: "Eric W. Biederman"
    Cc: Herbert Poetzl
    Cc: Andrey Savochkin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     
  • In some places, particularly drivers and __init code, the init utsns is the
    appropriate one to use. This patch replaces those with a the init_utsname
    helper.

    Changes: Removed several uses of init_utsname(). Hope I picked all the
    right ones in net/ipv4/ipconfig.c. These are now changed to
    utsname() (the per-process namespace utsname) in the previous
    patch (2/7)

    [akpm@osdl.org: CIFS fix]
    Signed-off-by: Serge E. Hallyn
    Cc: Kirill Korotaev
    Cc: "Eric W. Biederman"
    Cc: Herbert Poetzl
    Cc: Andrey Savochkin
    Cc: Serge Hallyn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     
  • Replace references to system_utsname to the per-process uts namespace
    where appropriate. This includes things like uname.

    Changes: Per Eric Biederman's comments, use the per-process uts namespace
    for ELF_PLATFORM, sunrpc, and parts of net/ipv4/ipconfig.c

    [jdike@addtoit.com: UML fix]
    [clg@fr.ibm.com: cleanup]
    [akpm@osdl.org: build fix]
    Signed-off-by: Serge E. Hallyn
    Cc: Kirill Korotaev
    Cc: "Eric W. Biederman"
    Cc: Herbert Poetzl
    Cc: Andrey Savochkin
    Signed-off-by: Cedric Le Goater
    Cc: Jeff Dike
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     
  • Define utsname() and init_utsname() which return &system_utsname. Users of
    system_utsname will be changed to use these helpers, after which
    system_utsname will disappear.

    Signed-off-by: Serge E. Hallyn
    Cc: Kirill Korotaev
    Cc: "Eric W. Biederman"
    Cc: Herbert Poetzl
    Cc: Andrey Savochkin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     
  • exit_task_namespaces() has replaced the former exit_namespace(). It
    invalidates task->nsproxy and associated namespaces. This is an issue for
    the (futur) pid namespace which is required to be valid in exit_notify().

    This patch moves exit_task_namespaces() after exit_notify() to keep nsproxy
    valid.

    Signed-off-by: Cedric Le Goater
    Cc: Serge E. Hallyn
    Cc: Kirill Korotaev
    Cc: "Eric W. Biederman"
    Cc: Herbert Poetzl
    Cc: Andrey Savochkin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cedric Le Goater
     
  • This moves the mount namespace into the nsproxy. The mount namespace count
    now refers to the number of nsproxies point to it, rather than the number of
    tasks. As a result, the unshare_namespace() function in kernel/fork.c no
    longer checks whether it is being shared.

    Signed-off-by: Serge Hallyn
    Cc: Kirill Korotaev
    Cc: "Eric W. Biederman"
    Cc: Herbert Poetzl
    Cc: Andrey Savochkin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     
  • Move the init_nsproxy definition out of arch/ into kernel/nsproxy.c. This
    avoids all arches having to be updated. Compiles and boots on s390.

    Signed-off-by: Serge E. Hallyn
    Cc: Kirill Korotaev
    Cc: "Eric W. Biederman"
    Cc: Herbert Poetzl
    Cc: Andrey Savochkin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     
  • This patch adds a nsproxy structure to the task struct. Later patches will
    move the fs namespace pointer into this structure, and introduce a new utsname
    namespace into the nsproxy.

    The vserver and openvz functionality, then, would be implemented in large part
    by virtualizing/isolating more and more resources into namespaces, each
    contained in the nsproxy.

    [akpm@osdl.org: build fix]
    Signed-off-by: Serge Hallyn
    Cc: Kirill Korotaev
    Cc: "Eric W. Biederman"
    Cc: Herbert Poetzl
    Cc: Andrey Savochkin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     
  • This patch makes the needlessly global _proc_do_string() static.

    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • The logic in proc_do_string is worth re-using without passing in a
    ctl_table structure (say, we want to calculate a pointer and pass that in
    instead); pass in the two fields it uses from that structure as explicit
    arguments.

    Signed-off-by: Sam Vilain
    Cc: Serge E. Hallyn
    Cc: Kirill Korotaev
    Cc: "Eric W. Biederman"
    Cc: Herbert Poetzl
    Cc: Andrey Savochkin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sam Vilain
     
  • while doing a kernel make modules_install install over an NFS mount.

    =============================================
    [ INFO: possible recursive locking detected ]
    ---------------------------------------------
    nfsd/9550 is trying to acquire lock:
    (&inode->i_mutex){--..}, at: [] mutex_lock+0x1c/0x1f

    but task is already holding lock:
    (&inode->i_mutex){--..}, at: [] mutex_lock+0x1c/0x1f

    other info that might help us debug this:
    2 locks held by nfsd/9550:
    #0: (hash_sem){..--}, at: [] exp_readlock+0xd/0xf [nfsd]
    #1: (&inode->i_mutex){--..}, at: [] mutex_lock+0x1c/0x1f

    stack backtrace:
    [] show_trace_log_lvl+0x58/0x152
    [] show_trace+0xd/0x10
    [] dump_stack+0x19/0x1b
    [] __lock_acquire+0x77a/0x9a3
    [] lock_acquire+0x60/0x80
    [] __mutex_lock_slowpath+0xa7/0x20e
    [] mutex_lock+0x1c/0x1f
    [] vfs_unlink+0x34/0x8a
    [] nfsd_unlink+0x18f/0x1e2 [nfsd]
    [] nfsd3_proc_remove+0x95/0xa2 [nfsd]
    [] nfsd_dispatch+0xc0/0x178 [nfsd]
    [] svc_process+0x3a5/0x5ed
    [] nfsd+0x1a7/0x305 [nfsd]
    [] kernel_thread_helper+0x5/0xb
    DWARF2 unwinder stuck at kernel_thread_helper+0x5/0xb
    Leftover inexact backtrace:
    [] show_trace+0xd/0x10
    [] dump_stack+0x19/0x1b
    [] __lock_acquire+0x77a/0x9a3
    [] lock_acquire+0x60/0x80
    [] __mutex_lock_slowpath+0xa7/0x20e
    [] mutex_lock+0x1c/0x1f
    [] vfs_unlink+0x34/0x8a
    [] nfsd_unlink+0x18f/0x1e2 [nfsd]
    [] nfsd3_proc_remove+0x95/0xa2 [nfsd]
    [] nfsd_dispatch+0xc0/0x178 [nfsd]
    [] svc_process+0x3a5/0x5ed
    [] nfsd+0x1a7/0x305 [nfsd]
    [] kernel_thread_helper+0x5/0xb

    =============================================
    [ INFO: possible recursive locking detected ]
    ---------------------------------------------
    nfsd/9580 is trying to acquire lock:
    (&inode->i_mutex){--..}, at: [] mutex_lock+0x1c/0x1f

    but task is already holding lock:
    (&inode->i_mutex){--..}, at: [] mutex_lock+0x1c/0x1f

    other info that might help us debug this:
    2 locks held by nfsd/9580:
    #0: (hash_sem){..--}, at: [] exp_readlock+0xd/0xf [nfsd]
    #1: (&inode->i_mutex){--..}, at: [] mutex_lock+0x1c/0x1f

    stack backtrace:
    [] show_trace_log_lvl+0x58/0x152
    [] show_trace+0xd/0x10
    [] dump_stack+0x19/0x1b
    [] __lock_acquire+0x77a/0x9a3
    [] lock_acquire+0x60/0x80
    [] __mutex_lock_slowpath+0xa7/0x20e
    [] mutex_lock+0x1c/0x1f
    [] nfsd_setattr+0x2c8/0x499 [nfsd]
    [] nfsd_create_v3+0x31b/0x4ac [nfsd]
    [] nfsd3_proc_create+0x128/0x138 [nfsd]
    [] nfsd_dispatch+0xc0/0x178 [nfsd]
    [] svc_process+0x3a5/0x5ed
    [] nfsd+0x1a7/0x305 [nfsd]
    [] kernel_thread_helper+0x5/0xb
    DWARF2 unwinder stuck at kernel_thread_helper+0x5/0xb
    Leftover inexact backtrace:
    [] show_trace+0xd/0x10
    [] dump_stack+0x19/0x1b
    [] __lock_acquire+0x77a/0x9a3
    [] lock_acquire+0x60/0x80
    [] __mutex_lock_slowpath+0xa7/0x20e
    [] mutex_lock+0x1c/0x1f
    [] nfsd_setattr+0x2c8/0x499 [nfsd]
    [] nfsd_create_v3+0x31b/0x4ac [nfsd]
    [] nfsd3_proc_create+0x128/0x138 [nfsd]
    [] nfsd_dispatch+0xc0/0x178 [nfsd]
    [] svc_process+0x3a5/0x5ed
    [] nfsd+0x1a7/0x305 [nfsd]
    [] kernel_thread_helper+0x5/0xb

    Signed-off-by: Peter Zijlstra
    Cc: Neil Brown
    Cc: Ingo Molnar
    Cc: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Add /proc/fs/nfsd/pool_threads which allows the sysadmin (or a userspace
    daemon) to read and change the number of nfsd threads in each pool. The
    format is a list of space-separated integers, one per pool.

    Signed-off-by: Greg Banks
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Banks
     
  • Actually implement multiple pools. On NUMA machines, allocate a svc_pool per
    NUMA node; on SMP a svc_pool per CPU; otherwise a single global pool. Enqueue
    sockets on the svc_pool corresponding to the CPU on which the socket bh is run
    (i.e. the NIC interrupt CPU). Threads have their cpu mask set to limit them
    to the CPUs in the svc_pool that owns them.

    This is the patch that allows an Altix to scale NFS traffic linearly
    beyond 4 CPUs and 4 NICs.

    Incorporates changes and feedback from Neil Brown, Trond Myklebust, and
    Christoph Hellwig.

    Signed-off-by: Greg Banks
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Banks
     
  • Replace the existing list of all nfsd threads with new code using
    svc_create_pooled().

    Signed-off-by: Greg Banks
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Banks
     
  • Currently knfsd keeps its own list of all nfsd threads in nfssvc.c; add a new
    way of managing the list of all threads in a svc_serv. Add
    svc_create_pooled() to allow creation of a svc_serv whose threads are managed
    by the sunrpc code. Add svc_set_num_threads() to manage the number of threads
    in a service, either per-pool or globally across the service.

    Signed-off-by: Greg Banks
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Banks
     
  • add svc_get() for those occasions when we need to temporarily bump up
    svc_serv->sv_nrthreads as a pseudo refcount.

    Signed-off-by: Greg Banks
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Banks
     
  • Split out the list of idle threads and pending sockets from svc_serv into a
    new svc_pool structure, and allocate a fixed number (in this patch, 1) of
    pools per svc_serv. The new structure contains a lock which takes over
    several of the duties of svc_serv->sv_lock, which is now relegated to
    protecting only sv_tempsocks, sv_permsocks, and sv_tmpcnt in svc_serv.

    The point is to move the hottest fields out of svc_serv and into svc_pool,
    allowing a following patch to arrange for a svc_pool per NUMA node or per CPU.
    This is a major step towards making the NFS server NUMA-friendly.

    Signed-off-by: Greg Banks
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Banks
     
  • The SK_BUSY bit in svc_sock->sk_flags ensures that we do not attempt to
    enqueue a socket twice. Currently, setting and clearing the bit is protected
    by svc_serv->sv_lock. As I intend to reduce the data that the lock protects
    so it's not held when svc_sock_enqueue() tests and sets SK_BUSY, that test and
    set needs to be atomic.

    Signed-off-by: Greg Banks
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Banks
     
  • Convert the svc_sock->sk_reserved variable from an int protected by
    svc_serv->sv_lock, to an atomic. This reduces (by 1) the number of places we
    need to take the (effectively global) svc_serv->sv_lock.

    Signed-off-by: Greg Banks
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Banks
     
  • Protect the svc_sock->sk_deferred list with a new lock svc_sock->sk_defer_lock
    instead of svc_serv->sv_lock. Using the more fine-grained lock reduces the
    number of places we need to take the svc_serv lock.

    Signed-off-by: Greg Banks
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Banks
     
  • Convert the svc_sock->sk_inuse counter from an int protected by
    svc_serv->sv_lock, to an atomic. This reduces the number of places we need to
    take the (effectively global) svc_serv->sv_lock.

    Signed-off-by: Greg Banks
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Banks
     
  • Following are 11 patches from Greg Banks which combine to make knfsd more
    Numa-aware. They reduce hitting on 'global' data structures, and create some
    data-structures that can be node-local.

    knfsd threads are bound to a particular node, and the thread to handle a new
    request is chosen from the threads that are attach to the node that received
    the interrupt.

    The distribution of threads across nodes can be controlled by a new file in
    the 'nfsd' filesystem, though the default approach of an even spread is
    probably fine for most sites.

    Some (old) numbers that show the efficacy of these patches: N == number of
    NICs == number of CPUs == nmber of clients. Number of NUMA nodes == N/2

    N Throughput, MiB/s CPU usage, % (max=N*100)
    Before After Before After
    --- ------ ---- ----- -----
    4 312 435 350 228
    6 500 656 501 418
    8 562 804 690 589

    This patch:

    Move the aging of RPC/TCP connection sockets from the main svc_recv() loop to
    a timer which uses a mark-and-sweep algorithm every 6 minutes. This reduces
    the amount of work that needs to be done in the main RPC loop and the length
    of time we need to hold the (effectively global) svc_serv->sv_lock.

    [akpm@osdl.org: cleanup]
    Signed-off-by: Greg Banks
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Banks
     
  • If lockd_up fails - what should we expect? Do we have to later call
    lockd_down?

    Well the nfs client thinks "no", the nfs server thinks "yes". lockd thinks
    "yes".

    The only answer that really makes sense is "no" !!

    So:
    Make lockd_up only increment nlmsvc_users on success.
    Make nfsd handle errors from lockd_up properly.
    Make sure lockd_up(0) never fails when lockd is running
    so that the 'reclaimer' call to lockd_up doesn't need to
    be error checked.

    Cc: "J. Bruce Fields"
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • Thus it is printed for any path that leads to failure (make_socks is called
    from two places).

    Cc: "J. Bruce Fields"
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • We should be checking the return value of lockd_up when adding a new socket to
    nfsd. So move the lockd_up before the svc_addsock and check the return value.

    The move is because lockd_down is easy, but there is no easy way to remove a
    recently added socket.

    Cc: "J. Bruce Fields"
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown