02 Dec, 2008

1 commit

  • It has been thought that the per-user file descriptors limit would also
    limit the resources that a normal user can request via the epoll
    interface. Vegard Nossum reported a very simple program (a modified
    version attached) that can make a normal user to request a pretty large
    amount of kernel memory, well within the its maximum number of fds. To
    solve such problem, default limits are now imposed, and /proc based
    configuration has been introduced. A new directory has been created,
    named /proc/sys/fs/epoll/ and inside there, there are two configuration
    points:

    max_user_instances = Maximum number of devices - per user

    max_user_watches = Maximum number of "watched" fds - per user

    The current default for "max_user_watches" limits the memory used by epoll
    to store "watches", to 1/32 of the amount of the low RAM. As example, a
    256MB 32bit machine, will have "max_user_watches" set to roughly 90000.
    That should be enough to not break existing heavy epoll users. The
    default value for "max_user_instances" is set to 128, that should be
    enough too.

    This also changes the userspace, because a new error code can now come out
    from EPOLL_CTL_ADD (-ENOSPC). The EMFILE from epoll_create() was already
    listed, so that should be ok.

    [akpm@linux-foundation.org: use get_current_user()]
    Signed-off-by: Davide Libenzi
    Cc: Michael Kerrisk
    Cc:
    Cc: Cyrill Gorcunov
    Reported-by: Vegard Nossum
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     

27 Oct, 2008

1 commit

  • In commit f337b9c58332bdecde965b436e47ea4c94d30da0 ("epoll: drop
    unnecessary test") Thomas found that there is an unnecessary (always
    true) test in ep_send_events(). The callback never inserts into
    ->rdllink while the send loop is performed, and also does the
    ~EP_PRIVATE_BITS test. Given we're holding the mutex during this time,
    the conditions tested inside the loop are always true.

    HOWEVER.

    The test "!ep_is_linked(&epi->rdllink)" wasn't there because we insert
    into ->rdllink, but because the send-events loop might terminate before
    the whole list is scanned (-EFAULT).

    In such cases, when the loop terminates early, and when a (leftover)
    file received an event while we're performing the lockless loop, we need
    such test to avoid to double insert the epoll items. The list_splice()
    done a few steps below, will correctly re-insert the ones that were left
    on "txlist".

    This should fix the kenrel.org bugzilla entry 11831.

    Signed-off-by: Davide Libenzi
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     

17 Oct, 2008

1 commit

  • Thomas found that there is an unnecessary (always true) test in
    ep_send_events(). The callback never inserts into ->rdllink while the
    send loop is performed, and also does the ~EP_PRIVATE_BITS test. Given
    we're holding the mutex during this time, the conditions tested inside the
    loop are always true. This patch drops the test done inside the
    re-insertion loop.

    Signed-off-by: Davide Libenzi
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     

13 Aug, 2008

1 commit


25 Jul, 2008

4 commits

  • Remove the size parameter from the new epoll_create syscall and renames the
    syscall itself. The updated test program follows.

    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    #include
    #include
    #include
    #include
    #include

    #ifndef __NR_epoll_create2
    # ifdef __x86_64__
    # define __NR_epoll_create2 291
    # elif defined __i386__
    # define __NR_epoll_create2 329
    # else
    # error "need __NR_epoll_create2"
    # endif
    #endif

    #define EPOLL_CLOEXEC O_CLOEXEC

    int
    main (void)
    {
    int fd = syscall (__NR_epoll_create2, 0);
    if (fd == -1)
    {
    puts ("epoll_create2(0) failed");
    return 1;
    }
    int coe = fcntl (fd, F_GETFD);
    if (coe == -1)
    {
    puts ("fcntl failed");
    return 1;
    }
    if (coe & FD_CLOEXEC)
    {
    puts ("epoll_create2(0) set close-on-exec flag");
    return 1;
    }
    close (fd);

    fd = syscall (__NR_epoll_create2, EPOLL_CLOEXEC);
    if (fd == -1)
    {
    puts ("epoll_create2(EPOLL_CLOEXEC) failed");
    return 1;
    }
    coe = fcntl (fd, F_GETFD);
    if (coe == -1)
    {
    puts ("fcntl failed");
    return 1;
    }
    if ((coe & FD_CLOEXEC) == 0)
    {
    puts ("epoll_create2(EPOLL_CLOEXEC) set close-on-exec flag");
    return 1;
    }
    close (fd);

    puts ("OK");

    return 0;
    }
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    Signed-off-by: Ulrich Drepper
    Acked-by: Davide Libenzi
    Cc: Michael Kerrisk
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ulrich Drepper
     
  • This patch adds test that ensure the boundary conditions for the various
    constants introduced in the previous patches is met. No code is generated.

    [akpm@linux-foundation.org: fix alpha]
    Signed-off-by: Ulrich Drepper
    Acked-by: Davide Libenzi
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ulrich Drepper
     
  • This patch adds the new epoll_create2 syscall. It extends the old epoll_create
    syscall by one parameter which is meant to hold a flag value. In this
    patch the only flag support is EPOLL_CLOEXEC which causes the close-on-exec
    flag for the returned file descriptor to be set.

    A new name EPOLL_CLOEXEC is introduced which in this implementation must
    have the same value as O_CLOEXEC.

    The following test must be adjusted for architectures other than x86 and
    x86-64 and in case the syscall numbers changed.

    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    #include
    #include
    #include
    #include
    #include

    #ifndef __NR_epoll_create2
    # ifdef __x86_64__
    # define __NR_epoll_create2 291
    # elif defined __i386__
    # define __NR_epoll_create2 329
    # else
    # error "need __NR_epoll_create2"
    # endif
    #endif

    #define EPOLL_CLOEXEC O_CLOEXEC

    int
    main (void)
    {
    int fd = syscall (__NR_epoll_create2, 1, 0);
    if (fd == -1)
    {
    puts ("epoll_create2(0) failed");
    return 1;
    }
    int coe = fcntl (fd, F_GETFD);
    if (coe == -1)
    {
    puts ("fcntl failed");
    return 1;
    }
    if (coe & FD_CLOEXEC)
    {
    puts ("epoll_create2(0) set close-on-exec flag");
    return 1;
    }
    close (fd);

    fd = syscall (__NR_epoll_create2, 1, EPOLL_CLOEXEC);
    if (fd == -1)
    {
    puts ("epoll_create2(EPOLL_CLOEXEC) failed");
    return 1;
    }
    coe = fcntl (fd, F_GETFD);
    if (coe == -1)
    {
    puts ("fcntl failed");
    return 1;
    }
    if ((coe & FD_CLOEXEC) == 0)
    {
    puts ("epoll_create2(EPOLL_CLOEXEC) set close-on-exec flag");
    return 1;
    }
    close (fd);

    puts ("OK");

    return 0;
    }
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    Signed-off-by: Ulrich Drepper
    Acked-by: Davide Libenzi
    Cc: Michael Kerrisk
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ulrich Drepper
     
  • This patch just extends the anon_inode_getfd interface to take an additional
    parameter with a flag value. The flag value is passed on to
    get_unused_fd_flags in anticipation for a use with the O_CLOEXEC flag.

    No actual semantic changes here, the changed callers all pass 0 for now.

    [akpm@linux-foundation.org: KVM fix]
    Signed-off-by: Ulrich Drepper
    Acked-by: Davide Libenzi
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ulrich Drepper
     

02 May, 2008

1 commit

  • a) none of the callers even looks at inode or file returned by anon_inode_getfd()
    b) any caller that would try to look at those would be racy, since by the time
    it returns we might have raced with close() from another thread and that
    file would be pining for fjords.

    Signed-off-by: Al Viro

    Al Viro
     

30 Apr, 2008

2 commits

  • Change all the #ifdef TIF_RESTORE_SIGMASK conditionals in non-arch code to
    #ifdef HAVE_SET_RESTORE_SIGMASK. If arch code defines it first, the generic
    set_restore_sigmask() using TIF_RESTORE_SIGMASK is not defined.

    Signed-off-by: Roland McGrath
    Cc: Oleg Nesterov
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: "Luck, Tony"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roland McGrath
     
  • This adds the set_restore_sigmask() inline in and
    replaces every set_thread_flag(TIF_RESTORE_SIGMASK) with a call to it. No
    change, but abstracts the details of the flag protocol from all the calls.

    Signed-off-by: Roland McGrath
    Cc: Oleg Nesterov
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: "Luck, Tony"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roland McGrath
     

29 Apr, 2008

1 commit

  • Epoll calls rb_set_parent(n, n) to initialize the rb-tree node, but
    rb_set_parent() accesses node's pointer in its code. This creates a
    warning in kmemcheck (reported by Vegard Nossum) about an uninitialized
    memory access. The warning is harmless since the following rb-tree node
    insert is going to overwrite the node data. In any case I think it's
    better to not have that happening at all, and fix it by simplifying the
    code to get rid of a few lines that became superfluous after the previous
    epoll changes.

    Signed-off-by: Davide Libenzi
    Cc: Vegard Nossum
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     

06 Feb, 2008

1 commit

  • On Sat, 2008-01-05 at 13:35 -0800, Davide Libenzi wrote:

    > I remember I talked with Arjan about this time ago. Basically, since 1)
    > you can drop an epoll fd inside another epoll fd 2) callback-based wakeups
    > are used, you can see a wake_up() from inside another wake_up(), but they
    > will never refer to the same lock instance.
    > Think about:
    >
    > dfd = socket(...);
    > efd1 = epoll_create();
    > efd2 = epoll_create();
    > epoll_ctl(efd1, EPOLL_CTL_ADD, dfd, ...);
    > epoll_ctl(efd2, EPOLL_CTL_ADD, efd1, ...);
    >
    > When a packet arrives to the device underneath "dfd", the net code will
    > issue a wake_up() on its poll wake list. Epoll (efd1) has installed a
    > callback wakeup entry on that queue, and the wake_up() performed by the
    > "dfd" net code will end up in ep_poll_callback(). At this point epoll
    > (efd1) notices that it may have some event ready, so it needs to wake up
    > the waiters on its poll wait list (efd2). So it calls ep_poll_safewake()
    > that ends up in another wake_up(), after having checked about the
    > recursion constraints. That are, no more than EP_MAX_POLLWAKE_NESTS, to
    > avoid stack blasting. Never hit the same queue, to avoid loops like:
    >
    > epoll_ctl(efd2, EPOLL_CTL_ADD, efd1, ...);
    > epoll_ctl(efd3, EPOLL_CTL_ADD, efd2, ...);
    > epoll_ctl(efd4, EPOLL_CTL_ADD, efd3, ...);
    > epoll_ctl(efd1, EPOLL_CTL_ADD, efd4, ...);
    >
    > The code "if (tncur->wq == wq || ..." prevents re-entering the same
    > queue/lock.

    Since the epoll code is very careful to not nest same instance locks
    allow the recursion.

    Signed-off-by: Peter Zijlstra
    Tested-by: Stefan Richter
    Acked-by: Davide Libenzi
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     

07 Dec, 2007

1 commit


20 Oct, 2007

1 commit


19 Oct, 2007

1 commit

  • Get rid of sparse related warnings from places that use integer as NULL
    pointer.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Stephen Hemminger
    Cc: Andi Kleen
    Cc: Jeff Garzik
    Cc: Matt Mackall
    Cc: Ian Kent
    Cc: Arnd Bergmann
    Cc: Davide Libenzi
    Cc: Stephen Smalley
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Hemminger
     

20 Jul, 2007

1 commit

  • Slab destructors were no longer supported after Christoph's
    c59def9f222d44bb7e2f0a559f2906191a0862d7 change. They've been
    BUGs for both slab and slub, and slob never supported them
    either.

    This rips out support for the dtor pointer from kmem_cache_create()
    completely and fixes up every single callsite in the kernel (there were
    about 224, not including the slab allocator definitions themselves,
    or the documentation references).

    Signed-off-by: Paul Mundt

    Paul Mundt
     

15 May, 2007

4 commits

  • Move the kfree() call inside the ep_free() function.

    Signed-off-by: Davide Libenzi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     
  • Fixes some epoll code comments.

    Signed-off-by: Davide Libenzi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     
  • Changes the rwlock to a spinlock, and drops the use-count variable.
    Operations are always bound by the mutex now, so the use-count is no more
    needed. For the same reason, the rwlock can become a simple spinlock.

    Signed-off-by: Davide Libenzi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     
  • Fixes the epoll single pass code. During the unlocked event delivery (to
    userspace) code, the poll callback can re-issue new events, and we must
    receive them correctly. Since we loop in a lockless fashion, we want to be
    O(nready), and we don't want to flash on/off the spinlock for every event, we
    have the poll callback to use a secondary list to queue events while we're
    inside the event delivery loop. The rw_semaphore has been turned into a
    mutex. This patch also adds the wait-exclusive flag, as suggested by Davi
    Arnaut.

    Signed-off-by: Davide Libenzi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     

11 May, 2007

3 commits


09 May, 2007

3 commits

  • There are many places in the kernel where the construction like

    foo = list_entry(head->next, struct foo_struct, list);

    are used.
    The code might look more descriptive and neat if using the macro

    list_first_entry(head, type, member) \
    list_entry((head)->next, type, member)

    Here is the macro itself and the examples of its usage in the generic code.
    If it will turn out to be useful, I can prepare the set of patches to
    inject in into arch-specific code, drivers, networking, etc.

    Signed-off-by: Pavel Emelianov
    Signed-off-by: Kirill Korotaev
    Cc: Randy Dunlap
    Cc: Andi Kleen
    Cc: Zach Brown
    Cc: Davide Libenzi
    Cc: John McCutchan
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: john stultz
    Cc: Ram Pai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelianov
     
  • Remove includes of where it is not used/needed.
    Suggested by Al Viro.

    Builds cleanly on x86_64, i386, alpha, ia64, powerpc, sparc,
    sparc64, and arm (all 59 defconfigs).

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • Epoll is doing multiple passes over the ready set at the moment, because of
    the constraints over the f_op->poll() call. Looking at the code again, I
    noticed that we already hold the epoll semaphore in read, and this
    (together with other locking conditions that hold while doing an
    epoll_wait()) can lead to a smarter way [1] to "ship" events to userspace
    (in a single pass).

    This is a stress application that can be used to test the new code. It
    spwans multiple thread and call epoll_wait() and epoll_ctl() from many
    threads. Stress tested on my dual Opteron 254 w/out any problems.

    http://www.xmailserver.org/totalmess.c

    This is not a benchmark, just something that tries to stress and exploit
    possible problems with the new code.
    Also, I made a stupid micro-benchmark:

    http://www.xmailserver.org/epwbench.c

    [1] Considering that epoll must be thread-safe, there are five ways we can
    be hit during an epoll_wait() transfer loop (ep_send_events()):

    1) The epoll fd going away and calling ep_free
    This just can't happen, since we did an fget() in sys_epoll_wait

    2) An epoll_ctl(EPOLL_CTL_DEL)
    This can't happen because epoll_ctl() gets ep->sem in write, and
    we're holding it in read during ep_send_events()

    3) An fd stored inside the epoll fd going away
    This can't happen because in eventpoll_release_file() we get
    ep->sem in write, and we're holding it in read during
    ep_send_events()

    4) Another epoll_wait() happening on another thread
    They both can be inside ep_send_events() at the same time, we get
    (splice) the ready-list under the spinlock, so each one will get
    its own ready list. Note that an fd cannot be at the same time
    inside more than one ready list, because ep_poll_callback() will
    not re-queue it if it sees it already linked:

    if (ep_is_linked(&epi->rdllink))
    goto is_linked;

    Another case that can happen, is two concurrent epoll_wait(),
    coming in with a userspace event buffer of size, say, ten.
    Suppose there are 50 event ready in the list. The first
    epoll_wait() will "steal" the whole list, while the second, seeing
    no events, will go to sleep. But at the end of ep_send_events() in
    the first epoll_wait(), we will re-inject surplus ready fds, and we
    will trigger the proper wake_up to the second epoll_wait().

    5) ep_poll_callback() hitting us asyncronously
    This is the tricky part. As I said above, the ep_is_linked() test
    done inside ep_poll_callback(), will guarantee us that until the
    item will result linked to a list, ep_poll_callback() will not try
    to re-queue it again (read, write data on any of its members). When
    we do a list_del() in ep_send_events(), the item will still satisfy
    the ep_is_linked() test (whatever data is written in prev/next,
    it'll never be its own pointer), so ep_poll_callback() will still
    leave us alone. It's only after the eventual smp_mb()+INIT_LIST_HEAD(&epi->rdllink)
    that it'll become visible to ep_poll_callback(), but at the point
    we're already past it.

    [akpm@osdl.org: 80 cols]
    Signed-off-by: Davide Libenzi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     

09 Dec, 2006

1 commit

  • This patch changes struct file to use struct path instead of having
    independent pointers to struct dentry and struct vfsmount, and converts all
    users of f_{dentry,vfsmnt} in fs/ to use f_path.{dentry,mnt}.

    Additionally, it adds two #define's to make the transition easier for users of
    the f_dentry and f_vfsmnt.

    Signed-off-by: Josef "Jeff" Sipek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Josef "Jeff" Sipek
     

08 Dec, 2006

2 commits

  • Replace all uses of kmem_cache_t with struct kmem_cache.

    The patch was generated using the following script:

    #!/bin/sh
    #
    # Replace one string by another in all the kernel sources.
    #

    set -e

    for file in `find * -name "*.c" -o -name "*.h"|xargs grep -l $1`; do
    quilt add $file
    sed -e "1,\$s/$1/$2/g" $file >/tmp/$$
    mv /tmp/$$ $file
    quilt refresh
    done

    The script was run like this

    sh replace kmem_cache_t "struct kmem_cache"

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • SLAB_KERNEL is an alias of GFP_KERNEL.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

12 Oct, 2006

1 commit

  • Implement the epoll_pwait system call, that extend the event wait mechanism
    with the same logic ppoll and pselect do. The definition of epoll_pwait
    is:

    int epoll_pwait(int epfd, struct epoll_event *events, int maxevents,
    int timeout, const sigset_t *sigmask, size_t sigsetsize);

    The difference between the vanilla epoll_wait and epoll_pwait is that the
    latter allows the caller to specify a signal mask to be set while waiting
    for events. Hence epoll_pwait will wait until either one monitored event,
    or an unmasked signal happen. If sigmask is NULL, the epoll_pwait system
    call will act exactly like epoll_wait. For the POSIX definition of
    pselect, information is available here:

    http://www.opengroup.org/onlinepubs/009695399/functions/select.html

    Signed-off-by: Davide Libenzi
    Cc: David Woodhouse
    Cc: Andi Kleen
    Cc: Michael Kerrisk
    Cc: Ulrich Drepper
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     

03 Oct, 2006

1 commit


27 Sep, 2006

1 commit

  • This eliminates the i_blksize field from struct inode. Filesystems that want
    to provide a per-inode st_blksize can do so by providing their own getattr
    routine instead of using the generic_fillattr() function.

    Note that some filesystems were providing pretty much random (and incorrect)
    values for i_blksize.

    [bunk@stusta.de: cleanup]
    [akpm@osdl.org: generic_fillattr() fix]
    Signed-off-by: "Theodore Ts'o"
    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Theodore Ts'o
     

28 Aug, 2006

1 commit


04 Jul, 2006

1 commit


26 Jun, 2006

1 commit

  • A few days ago Arjan signaled a lockdep red flag on epoll locks, and
    precisely between the epoll's device structure lock (->lock) and the wait
    queue head lock (->lock).

    Like I explained in another email, and directly to Arjan, this can't happen
    in reality because of the explicit check at eventpoll.c:592, that does not
    allow to drop an epoll fd inside the same epoll fd. Since lockdep is
    working on per-structure locks, it will never be able to know of policies
    enforced in other parts of the code.

    It was decided time ago of having the ability to drop epoll fds inside
    other epoll fds, that triggers a very trick wakeup operations (due to
    possibly reentrant callback-driven wakeups) handled by the
    ep_poll_safewake() function. While looking again at the code though, I
    noticed that all the operations done on the epoll's main structure wait
    queue head (->wq) are already protected by the epoll lock (->lock), so that
    locked-style functions can be used to manipulate the ->wq member. This
    makes both a lock-acquire save, and lockdep happy.

    Running totalmess on my dual opteron for a while did not reveal any problem
    so far:

    http://www.xmailserver.org/totalmess.c

    Signed-off-by: Davide Libenzi
    Cc: Arjan van de Ven
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     

23 Jun, 2006

1 commit

  • Extend the get_sb() filesystem operation to take an extra argument that
    permits the VFS to pass in the target vfsmount that defines the mountpoint.

    The filesystem is then required to manually set the superblock and root dentry
    pointers. For most filesystems, this should be done with simple_set_mnt()
    which will set the superblock pointer and then set the root dentry to the
    superblock's s_root (as per the old default behaviour).

    The get_sb() op now returns an integer as there's now no need to return the
    superblock pointer.

    This patch permits a superblock to be implicitly shared amongst several mount
    points, such as can be done with NFS to avoid potential inode aliasing. In
    such a case, simple_set_mnt() would not be called, and instead the mnt_root
    and mnt_sb would be set directly.

    The patch also makes the following changes:

    (*) the get_sb_*() convenience functions in the core kernel now take a vfsmount
    pointer argument and return an integer, so most filesystems have to change
    very little.

    (*) If one of the convenience function is not used, then get_sb() should
    normally call simple_set_mnt() to instantiate the vfsmount. This will
    always return 0, and so can be tail-called from get_sb().

    (*) generic_shutdown_super() now calls shrink_dcache_sb() to clean up the
    dcache upon superblock destruction rather than shrink_dcache_anon().

    This is required because the superblock may now have multiple trees that
    aren't actually bound to s_root, but that still need to be cleaned up. The
    currently called functions assume that the whole tree is rooted at s_root,
    and that anonymous dentries are not the roots of trees which results in
    dentries being left unculled.

    However, with the way NFS superblock sharing are currently set to be
    implemented, these assumptions are violated: the root of the filesystem is
    simply a dummy dentry and inode (the real inode for '/' may well be
    inaccessible), and all the vfsmounts are rooted on anonymous[*] dentries
    with child trees.

    [*] Anonymous until discovered from another tree.

    (*) The documentation has been adjusted, including the additional bit of
    changing ext2_* into foo_* in the documentation.

    [akpm@osdl.org: convert ipath_fs, do other stuff]
    Signed-off-by: David Howells
    Acked-by: Al Viro
    Cc: Nathan Scott
    Cc: Roland Dreier
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     

21 Apr, 2006

1 commit


11 Apr, 2006

1 commit


29 Mar, 2006

1 commit

  • This is a conversion to make the various file_operations structs in fs/
    const. Basically a regexp job, with a few manual fixups

    The goal is both to increase correctness (harder to accidentally write to
    shared datastructures) and reducing the false sharing of cachelines with
    things that get dirty in .data (while .rodata is nicely read only and thus
    cache clean)

    Signed-off-by: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arjan van de Ven