01 Jul, 2006

11 commits

  • * git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial:
    Remove obsolete #include
    remove obsolete swsusp_encrypt
    arch/arm26/Kconfig typos
    Documentation/IPMI typos
    Kconfig: Typos in net/sched/Kconfig
    v9fs: do not include linux/version.h
    Documentation/DocBook/mtdnand.tmpl: typo fixes
    typo fixes: specfic -> specific
    typo fixes in Documentation/networking/pktgen.txt
    typo fixes: occuring -> occurring
    typo fixes: infomation -> information
    typo fixes: disadvantadge -> disadvantage
    typo fixes: aquire -> acquire
    typo fixes: mecanism -> mechanism
    typo fixes: bandwith -> bandwidth
    fix a typo in the RTC_CLASS help text
    smb is no longer maintained

    Manually merged trivial conflict in arch/um/kernel/vmlinux.lds.S

    Linus Torvalds
     
  • Conversion of nr_bounce to a per zone counter

    nr_bounce is only used for proc output. So it could be left as an event
    counter. However, the event counters may not be accurate and nr_bounce is
    categorizing types of pages in a zone. So we really need this to also be a
    per zone counter.

    [akpm@osdl.org: bugfix]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Conversion of nr_unstable to a per zone counter

    We need to do some special modifications to the nfs code since there are
    multiple cases of disposition and we need to have a page ref for proper
    accounting.

    This converts the last critical page state of the VM and therefore we need to
    remove several functions that were depending on GET_PAGE_STATE_LAST in order
    to make the kernel compile again. We are only left with event type counters
    in page state.

    [akpm@osdl.org: bugfixes]
    Signed-off-by: Christoph Lameter
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Conversion of nr_writeback to per zone counter.

    This removes the last page_state counter from arch/i386/mm/pgtable.c so we
    drop the page_state from there.

    [akpm@osdl.org: bugfix]
    Signed-off-by: Christoph Lameter
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • This makes nr_dirty a per zone counter. Looping over all processors is
    avoided during writeback state determination.

    The counter aggregation for nr_dirty had to be undone in the NFS layer since
    we summed up the page counts from multiple zones. Someone more familiar with
    NFS should probably review what I have done.

    [akpm@osdl.org: bugfix]
    Signed-off-by: Christoph Lameter
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Conversion of nr_page_table_pages to a per zone counter

    [akpm@osdl.org: bugfix]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • - Allows reclaim to access counter without looping over processor counts.

    - Allows accurate statistics on how many pages are used in a zone by
    the slab. This may become useful to balance slab allocations over
    various zones.

    [akpm@osdl.org: bugfix]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • The current NR_FILE_MAPPED is used by zone reclaim and the dirty load
    calculation as the number of mapped pagecache pages. However, that is not
    true. NR_FILE_MAPPED includes the mapped anonymous pages. This patch
    separates those and therefore allows an accurate tracking of the anonymous
    pages per zone.

    It then becomes possible to determine the number of unmapped pages per zone
    and we can avoid scanning for unmapped pages if there are none.

    Also it may now be possible to determine the mapped/unmapped ratio in
    get_dirty_limit. Isnt the number of anonymous pages irrelevant in that
    calculation?

    Note that this will change the meaning of the number of mapped pages reported
    in /proc/vmstat /proc/meminfo and in the per node statistics. This may affect
    user space tools that monitor these counters! NR_FILE_MAPPED works like
    NR_FILE_DIRTY. It is only valid for pagecache pages.

    Signed-off-by: Christoph Lameter
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Currently a single atomic variable is used to establish the size of the page
    cache in the whole machine. The zoned VM counters have the same method of
    implementation as the nr_pagecache code but also allow the determination of
    the pagecache size per zone.

    Remove the special implementation for nr_pagecache and make it a zoned counter
    named NR_FILE_PAGES.

    Updates of the page cache counters are always performed with interrupts off.
    We can therefore use the __ variant here.

    Signed-off-by: Christoph Lameter
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • nr_mapped is important because it allows a determination of how many pages of
    a zone are not mapped, which would allow a more efficient means of determining
    when we need to reclaim memory in a zone.

    We take the nr_mapped field out of the page state structure and define a new
    per zone counter named NR_FILE_MAPPED (the anonymous pages will be split off
    from NR_MAPPED in the next patch).

    We replace the use of nr_mapped in various kernel locations. This avoids the
    looping over all processors in try_to_free_pages(), writeback, reclaim (swap +
    zone reclaim).

    [akpm@osdl.org: bugfix]
    Signed-off-by: Christoph Lameter
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Signed-off-by: Jörn Engel
    Signed-off-by: Adrian Bunk

    Jörn Engel
     

28 Jun, 2006

1 commit

  • Move the i386 VDSO down into a vma and thus randomize it.

    Besides the security implications, this feature also helps debuggers, which
    can COW a vma-backed VDSO just like a normal DSO and can thus do
    single-stepping and other debugging features.

    It's good for hypervisors (Xen, VMWare) too, which typically live in the same
    high-mapped address space as the VDSO, hence whenever the VDSO is used, they
    get lots of guest pagefaults and have to fix such guest accesses up - which
    slows things down instead of speeding things up (the primary purpose of the
    VDSO).

    There's a new CONFIG_COMPAT_VDSO (default=y) option, which provides support
    for older glibcs that still rely on a prelinked high-mapped VDSO. Newer
    distributions (using glibc 2.3.3 or later) can turn this option off. Turning
    it off is also recommended for security reasons: attackers cannot use the
    predictable high-mapped VDSO page as syscall trampoline anymore.

    There is a new vdso=[0|1] boot option as well, and a runtime
    /proc/sys/vm/vdso_enabled sysctl switch, that allows the VDSO to be turned
    on/off.

    (This version of the VDSO-randomization patch also has working ELF
    coredumping, the previous patch crashed in the coredumping code.)

    This code is a combined work of the exec-shield VDSO randomization
    code and Gerd Hoffmann's hypervisor-centric VDSO patch. Rusty Russell
    started this patch and i completed it.

    [akpm@osdl.org: cleanups]
    [akpm@osdl.org: compile fix]
    [akpm@osdl.org: compile fix 2]
    [akpm@osdl.org: compile fix 3]
    [akpm@osdl.org: revernt MAXMEM change]
    Signed-off-by: Ingo Molnar
    Signed-off-by: Arjan van de Ven
    Cc: Gerd Hoffmann
    Cc: Rusty Russell
    Cc: Zachary Amsden
    Cc: Andi Kleen
    Cc: Jan Beulich
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     

27 Jun, 2006

27 commits

  • Below is a patch to add a new /proc/self/attr/sockcreate A process may write a
    context into this interface and all subsequent sockets created will be labeled
    with that context. This is the same idea as the fscreate interface where a
    process can specify the label of a file about to be created. At this time one
    envisioned user of this will be xinetd. It will be able to better label
    sockets for the actual services. At this time all sockets take the label of
    the creating process, so all xinitd sockets would just be labeled the same.

    I tested this by creating a tcp sender and listener. The sender was able to
    write to this new proc file and then create sockets with the specified label.
    I am able to be sure the new label was used since the avc denial messages
    kicked out by the kernel included both the new security permission
    setsockcreate and all the socket denials were for the new label, not the label
    of the running process.

    Signed-off-by: Eric Paris
    Signed-off-by: James Morris
    Cc: Chris Wright
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Paris
     
  • Try to make next_tid() a bit more readable and deletes unnecessary
    "pid_alive(pos)" check.

    Signed-off-by: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • first_tid:

    /* If nr exceeds the number of threads there is nothing todo */
    if (nr) {
    if (nr >= get_nr_threads(leader))
    goto done;
    }

    This is not reliable: sub-threads can exit after this check, so the
    'for' loop below can overlap and proc_task_readdir() can return an
    already filldir'ed dirents.

    for (; pos && pid_alive(pos); pos = next_thread(pos)) {
    if (--nr > 0)
    continue;

    Off-by-one error, will return 'leader' when nr == 1.

    This patch tries to fix these problems and simplify the code.

    Signed-off-by: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • This is just like my previous removal of tasklist_lock from first_tgid, and
    next_tgid. It simply had to wait until it was rcu safe to walk the thread
    list.

    This should be the last instance of the tasklist_lock in proc. So user
    processes should not be able to influence the tasklist lock hold times.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • In process of getting proc_fd_access_allowed to work it has developed a few
    warts. In particular the special case that always allows introspection and
    the special case to allow inspection of kernel threads.

    The special case for introspection is needed for /proc/self/mem.

    The special case for kernel threads really should be overridable
    by security modules.

    So consolidate these checks into ptrace.c:may_attach().

    The check to always allow introspection is trivial.

    The check to allow access to kernel threads, and zombies is a little
    trickier. mem_read and mem_write already verify an mm exists so it isn't
    needed twice. proc_fd_access_allowed only doesn't want a check to verify
    task->mm exits, s it prevents all access to kernel threads. So just move
    the task->mm check into ptrace_attach where it is needed for practical
    reasons.

    I did a quick audit and none of the security modules in the kernel seem to
    care if they are passed a task without an mm into security_ptrace. So the
    above move should be safe and it allows security modules to come up with
    more restrictive policy.

    Signed-off-by: Eric W. Biederman
    Cc: Stephen Smalley
    Cc: Chris Wright
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • Since 2.2 we have been doing a chroot check to see if it is appropriate to
    return a read or follow one of these magic symlinks. The chroot check was
    asking a question about the visibility of files to the calling process and
    it was actually checking the destination process, and not the files
    themselves. That test was clearly bogus.

    In my first pass through I simply fixed the test to check the visibility of
    the files themselves. That naive approach to fixing the permissions was
    too strict and resulted in cases where a task could not even see all of
    it's file descriptors.

    What has disturbed me about relaxing this check is that file descriptors
    are per-process private things, and they are occasionaly used a user space
    capability tokens. Looking a little farther into the symlink path on /proc
    I did find userid checks and a check for capability (CAP_DAC_OVERRIDE) so
    there were permissions checking this.

    But I was still concerned about privacy. Besides /proc there is only one
    other way to find out this kind of information, and that is ptrace. ptrace
    has been around for a long time and it has a well established security
    model.

    So after thinking about it I finally realized that the permission checks
    that make sense are the permission checks applied to ptrace_attach. The
    checks are simple per process, and won't cause nasty surprises for people
    coming from less capable unices.

    Unfortunately there is one case that the current ptrace_attach test does
    not cover: Zombies and kernel threads. Single stepping those kinds of
    processes is impossible. Being able to see which file descriptors are open
    on these tasks is important to lsof, fuser and friends. So for these
    special processes I made the rule you can't find out unless you have
    CAP_SYS_PTRACE.

    These proc permission checks should now conform to the principle of least
    surprise. As well as using much less code to implement :)

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • The code doesn't need to sleep to when making this check so I can just do the
    comparison and not worry about the reference counts.

    TODO: While looking at this I realized that my original cleanup did not push
    the permission check far enough down into the stack. The call of
    proc_check_dentry_visible needs to move out of the generic proc
    readlink/follow link code and into the individual get_link instances.
    Otherwise the shared resources checks are not quite correct (shared
    files_struct does not require a shared fs_struct), and there are races with
    unshare.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • Incrementally update my proc-dont-lock-task_structs-indefinitely patches so
    that they work with struct pid instead of struct task_ref.

    Mostly this is a straight 1-1 substitution.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • Every inode in /proc holds a reference to a struct task_struct. If a
    directory or file is opened and remains open after the the task exits this
    pinning continues. With 8K stacks on a 32bit machine the amount pinned per
    file descriptor is about 10K.

    Normally I would figure a reasonable per user process limit is about 100
    processes. With 80 processes, with a 1000 file descriptors each I can trigger
    the 00M killer on a 32bit kernel, because I have pinned about 800MB of useless
    data.

    This patch replaces the struct task_struct pointer with a pointer to a struct
    task_ref which has a struct task_struct pointer. The so the pinning of dead
    tasks does not happen.

    The code now has to contend with the fact that the task may now exit at any
    time. Which is a little but not muh more complicated.

    With this change it takes about 1000 processes each opening up 1000 file
    descriptors before I can trigger the OOM killer. Much better.

    [mlp@google.com: task_mmu small fixes]
    Signed-off-by: Eric W. Biederman
    Cc: Trond Myklebust
    Cc: Paul Jackson
    Cc: Oleg Nesterov
    Cc: Albert Cahalan
    Signed-off-by: Prasanna Meda
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • Currently in /proc at several different places we define buffers to hold a
    process id, or a file descriptor . In most of them we use either a hard coded
    number or a different define. Modify them all to use PROC_NUMBUF, so the code
    has a chance of being maintained.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • Like the bug Oleg spotted in first_tid there was also a small off by one
    error in first_tgid, when a seek was done on the /proc directory. This
    fixes that and changes the code structure to make it a little more obvious
    what is going on.

    Signed-off-by: Eric W. Biederman
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • Since we no longer need the tasklist_lock for get_task_struct the lookup
    methods no longer need the tasklist_lock.

    This just depends on my previous patch that makes get_task_struct() rcu
    safe.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • We don't need the tasklist_lock to safely iterate through processes
    anymore.

    This depends on my previous to task patches that make get_task_struct rcu
    safe, and that make next_task() rcu safe. I haven't gotten
    first_tid/next_tid yet only because next_thread is missing an
    rcu_dereference.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • There are a couple of problems this patch addresses.
    - /proc//task currently does not work correctly if you stop reading
    in the middle of a directory.

    - /proc/ currently requires a full pass through the task list with
    the tasklist lock held, to determine there are no more processes to read.

    - The hand rolled integer to string conversion does not properly running
    out of buffer space.

    - We seem to be batching reading of pids from the tasklist without reason,
    and complicating the logic of the code.

    This patch addresses that by changing how tasks are processed. A
    first_ function is built that handles restarts, and a
    next_ function is built that just advances to the next task.

    first_ when it detects a restart usually uses find_task_by_pid. If
    that doesn't work because there has been a seek on the directory, or we have
    already given a complete directory listing, it first checks the number tasks
    of that type, and only if we are under that count does it walk through all of
    the tasks to find the one we are interested in.

    The code that fills in the directory is simpler because there is only a single
    for loop.

    The hand rolled integer to string conversion is replaced by snprintf which
    should handle the the out of buffer case correctly.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • proc_lookup and task exiting are not synchronized, although some of the
    previous code may have suggested that. Every time before we reuse a dentry
    namei.c calls d_op->derevalidate which prevents us from reusing a stale dcache
    entry. Unfortunately it does not prevent us from returning a stale dcache
    entry. This race has been explicitly plugged in proc_pid_lookup but there is
    nothing to confine it to just that proc lookup function.

    So to prevent the race I call revalidate explictily in all of the proc lookup
    functions after I call d_add, and report an error if the revalidate does not
    succeed.

    Years ago Al Viro did something similar but those changes got lost in the
    churn.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • To keep the dcache from filling up with dead /proc entries we flush them on
    process exit. However over the years that code has gotten hairy with a
    dentry_pointer and a lock in task_struct and misdocumented as a correctness
    feature.

    I have rewritten this code to look and see if we have a corresponding entry in
    the dcache and if so flush it on process exit. This removes the extra fields
    in the task_struct and allows me to trivially handle the case of a
    /proc//task/ entry as well as the current /proc/ entries.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • All of the functions for proc_maps_operations are already defined in
    task_mmu.c so move the operations structure to keep the functionality
    together.

    Since task_nommu.c implements a dummy version of /proc//maps give it a
    simplified version of proc_maps_operations that it can modify to best suit its
    needs.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • Use getattr to get an accurate link count when needed. This is cheaper and
    more accurate than trying to derive it by walking the thread list of a
    process.

    Especially as it happens when needed stat instead of at readdir time.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • Long ago and far away in 2.2 we started checking to ensure the files we
    displayed in /proc were visible to the current process. It was an
    unsophisticated time and no one was worried about functions full of FIXMES in
    a stable kernel. As time passed the function became sacred and was enshrined
    in the shrine of how things have always been. The fixes came in but only to
    keep the function working no one really remembering or documenting why we did
    things that way.

    The intent and the functionality make a lot of sense. Don't let /proc be an
    access point for files a process can see no other way. The implementation
    however is completely wrong.

    We are currently checking the root directories of the two processes, we are
    not checking the actual file descriptors themselves.

    We are strangely checking with a permission method instead of just when we use
    the data.

    This patch fixes the logic to actually check the file descriptors and make a
    note that implementing a permission method for this part of /proc almost
    certainly indicates a bug in the reasoning.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • The inode operations only exist to support the proc_permission function.
    Currently mem_read and mem_write have all the same permission checks as
    ptrace. The fs check makes no sense in this context, and we can trivially get
    around it by calling ptrace.

    So simply the code by killing the strange weird case.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • First we can access every /proc//task/ directory as /proc/ so
    proc_task_permission is not usefully limiting visibility.

    Second having related filesystems information should have nothing to do with
    process visibility. kill does not implement any checks like that.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • The sole renaming use of proc_inode.type is to discover the file descriptor
    number, so just store the file descriptor number and don't wory about
    processing this field. This removes any /proc limits on the maximum number of
    file descriptors, and clears the path to make the hard coded /proc inode
    numbers go away.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • Currently in /proc if the task is dumpable all of files are owned by the tasks
    effective users. Otherwise the files are owned by root. Unless it is the
    /proc// or /proc//task/ directory in that case we always make
    the directory owned by the effective user.

    However the special case for directories is pointless except as a way to read
    the effective user, because the permissions on both of those directories are
    world readable, and executable.

    /proc//status provides a much better way to read a processes effecitve
    userid, so it is silly to try to provide that on the directory.

    So this patch simplifies the code by removing a pointless special case and
    gets us one step closer to being able to remove the hard coded /proc inode
    numbers.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • The removed fields are already set by proc_alloc_inode. Initializing them in
    proc_alloc_inode implies they need it for proper cleanup. At least ei->pde
    was not set on all paths making it look like proc_alloc_inode was buggy. So
    just remove the redundant assignments.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • We already call everything except do_proc_readlink outside of the BKL in
    proc_pid_followlink, and there appears to be nothing in do_proc_readlink that
    needs any special protection.

    So remove this leftover from one of the BKL cleanup efforts.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • Signed-off-by: Eric W. Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • Add a /proc//attr/keycreate entry that stores the appropriate context for
    newly-created keys. Modify the selinux_key_alloc hook to make use of the new
    entry. Update the flask headers to include a new "setkeycreate" permission
    for processes. Update the flask headers to include a new "create" permission
    for keys. Use the create permission to restrict which SIDs each task can
    assign to newly-created keys. Add a new parameter to the security hook
    "security_key_alloc" to indicate whether it is being invoked by the kernel, or
    from userspace. If it is being invoked by the kernel, the security hook
    should never fail. Update the documentation to reflect these changes.

    Signed-off-by: Michael LeMay
    Signed-off-by: James Morris
    Signed-off-by: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael LeMay
     

23 Jun, 2006

1 commit

  • Extend the get_sb() filesystem operation to take an extra argument that
    permits the VFS to pass in the target vfsmount that defines the mountpoint.

    The filesystem is then required to manually set the superblock and root dentry
    pointers. For most filesystems, this should be done with simple_set_mnt()
    which will set the superblock pointer and then set the root dentry to the
    superblock's s_root (as per the old default behaviour).

    The get_sb() op now returns an integer as there's now no need to return the
    superblock pointer.

    This patch permits a superblock to be implicitly shared amongst several mount
    points, such as can be done with NFS to avoid potential inode aliasing. In
    such a case, simple_set_mnt() would not be called, and instead the mnt_root
    and mnt_sb would be set directly.

    The patch also makes the following changes:

    (*) the get_sb_*() convenience functions in the core kernel now take a vfsmount
    pointer argument and return an integer, so most filesystems have to change
    very little.

    (*) If one of the convenience function is not used, then get_sb() should
    normally call simple_set_mnt() to instantiate the vfsmount. This will
    always return 0, and so can be tail-called from get_sb().

    (*) generic_shutdown_super() now calls shrink_dcache_sb() to clean up the
    dcache upon superblock destruction rather than shrink_dcache_anon().

    This is required because the superblock may now have multiple trees that
    aren't actually bound to s_root, but that still need to be cleaned up. The
    currently called functions assume that the whole tree is rooted at s_root,
    and that anonymous dentries are not the roots of trees which results in
    dentries being left unculled.

    However, with the way NFS superblock sharing are currently set to be
    implemented, these assumptions are violated: the root of the filesystem is
    simply a dummy dentry and inode (the real inode for '/' may well be
    inaccessible), and all the vfsmounts are rooted on anonymous[*] dentries
    with child trees.

    [*] Anonymous until discovered from another tree.

    (*) The documentation has been adjusted, including the additional bit of
    changing ext2_* into foo_* in the documentation.

    [akpm@osdl.org: convert ipath_fs, do other stuff]
    Signed-off-by: David Howells
    Acked-by: Al Viro
    Cc: Nathan Scott
    Cc: Roland Dreier
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells