20 Jul, 2007

1 commit

  • Slab destructors were no longer supported after Christoph's
    c59def9f222d44bb7e2f0a559f2906191a0862d7 change. They've been
    BUGs for both slab and slub, and slob never supported them
    either.

    This rips out support for the dtor pointer from kmem_cache_create()
    completely and fixes up every single callsite in the kernel (there were
    about 224, not including the slab allocator definitions themselves,
    or the documentation references).

    Signed-off-by: Paul Mundt

    Paul Mundt
     

15 May, 2007

4 commits

  • Move the kfree() call inside the ep_free() function.

    Signed-off-by: Davide Libenzi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     
  • Fixes some epoll code comments.

    Signed-off-by: Davide Libenzi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     
  • Changes the rwlock to a spinlock, and drops the use-count variable.
    Operations are always bound by the mutex now, so the use-count is no more
    needed. For the same reason, the rwlock can become a simple spinlock.

    Signed-off-by: Davide Libenzi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     
  • Fixes the epoll single pass code. During the unlocked event delivery (to
    userspace) code, the poll callback can re-issue new events, and we must
    receive them correctly. Since we loop in a lockless fashion, we want to be
    O(nready), and we don't want to flash on/off the spinlock for every event, we
    have the poll callback to use a secondary list to queue events while we're
    inside the event delivery loop. The rw_semaphore has been turned into a
    mutex. This patch also adds the wait-exclusive flag, as suggested by Davi
    Arnaut.

    Signed-off-by: Davide Libenzi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     

11 May, 2007

3 commits


09 May, 2007

3 commits

  • There are many places in the kernel where the construction like

    foo = list_entry(head->next, struct foo_struct, list);

    are used.
    The code might look more descriptive and neat if using the macro

    list_first_entry(head, type, member) \
    list_entry((head)->next, type, member)

    Here is the macro itself and the examples of its usage in the generic code.
    If it will turn out to be useful, I can prepare the set of patches to
    inject in into arch-specific code, drivers, networking, etc.

    Signed-off-by: Pavel Emelianov
    Signed-off-by: Kirill Korotaev
    Cc: Randy Dunlap
    Cc: Andi Kleen
    Cc: Zach Brown
    Cc: Davide Libenzi
    Cc: John McCutchan
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: john stultz
    Cc: Ram Pai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelianov
     
  • Remove includes of where it is not used/needed.
    Suggested by Al Viro.

    Builds cleanly on x86_64, i386, alpha, ia64, powerpc, sparc,
    sparc64, and arm (all 59 defconfigs).

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • Epoll is doing multiple passes over the ready set at the moment, because of
    the constraints over the f_op->poll() call. Looking at the code again, I
    noticed that we already hold the epoll semaphore in read, and this
    (together with other locking conditions that hold while doing an
    epoll_wait()) can lead to a smarter way [1] to "ship" events to userspace
    (in a single pass).

    This is a stress application that can be used to test the new code. It
    spwans multiple thread and call epoll_wait() and epoll_ctl() from many
    threads. Stress tested on my dual Opteron 254 w/out any problems.

    http://www.xmailserver.org/totalmess.c

    This is not a benchmark, just something that tries to stress and exploit
    possible problems with the new code.
    Also, I made a stupid micro-benchmark:

    http://www.xmailserver.org/epwbench.c

    [1] Considering that epoll must be thread-safe, there are five ways we can
    be hit during an epoll_wait() transfer loop (ep_send_events()):

    1) The epoll fd going away and calling ep_free
    This just can't happen, since we did an fget() in sys_epoll_wait

    2) An epoll_ctl(EPOLL_CTL_DEL)
    This can't happen because epoll_ctl() gets ep->sem in write, and
    we're holding it in read during ep_send_events()

    3) An fd stored inside the epoll fd going away
    This can't happen because in eventpoll_release_file() we get
    ep->sem in write, and we're holding it in read during
    ep_send_events()

    4) Another epoll_wait() happening on another thread
    They both can be inside ep_send_events() at the same time, we get
    (splice) the ready-list under the spinlock, so each one will get
    its own ready list. Note that an fd cannot be at the same time
    inside more than one ready list, because ep_poll_callback() will
    not re-queue it if it sees it already linked:

    if (ep_is_linked(&epi->rdllink))
    goto is_linked;

    Another case that can happen, is two concurrent epoll_wait(),
    coming in with a userspace event buffer of size, say, ten.
    Suppose there are 50 event ready in the list. The first
    epoll_wait() will "steal" the whole list, while the second, seeing
    no events, will go to sleep. But at the end of ep_send_events() in
    the first epoll_wait(), we will re-inject surplus ready fds, and we
    will trigger the proper wake_up to the second epoll_wait().

    5) ep_poll_callback() hitting us asyncronously
    This is the tricky part. As I said above, the ep_is_linked() test
    done inside ep_poll_callback(), will guarantee us that until the
    item will result linked to a list, ep_poll_callback() will not try
    to re-queue it again (read, write data on any of its members). When
    we do a list_del() in ep_send_events(), the item will still satisfy
    the ep_is_linked() test (whatever data is written in prev/next,
    it'll never be its own pointer), so ep_poll_callback() will still
    leave us alone. It's only after the eventual smp_mb()+INIT_LIST_HEAD(&epi->rdllink)
    that it'll become visible to ep_poll_callback(), but at the point
    we're already past it.

    [akpm@osdl.org: 80 cols]
    Signed-off-by: Davide Libenzi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     

09 Dec, 2006

1 commit

  • This patch changes struct file to use struct path instead of having
    independent pointers to struct dentry and struct vfsmount, and converts all
    users of f_{dentry,vfsmnt} in fs/ to use f_path.{dentry,mnt}.

    Additionally, it adds two #define's to make the transition easier for users of
    the f_dentry and f_vfsmnt.

    Signed-off-by: Josef "Jeff" Sipek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Josef "Jeff" Sipek
     

08 Dec, 2006

2 commits

  • Replace all uses of kmem_cache_t with struct kmem_cache.

    The patch was generated using the following script:

    #!/bin/sh
    #
    # Replace one string by another in all the kernel sources.
    #

    set -e

    for file in `find * -name "*.c" -o -name "*.h"|xargs grep -l $1`; do
    quilt add $file
    sed -e "1,\$s/$1/$2/g" $file >/tmp/$$
    mv /tmp/$$ $file
    quilt refresh
    done

    The script was run like this

    sh replace kmem_cache_t "struct kmem_cache"

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • SLAB_KERNEL is an alias of GFP_KERNEL.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

12 Oct, 2006

1 commit

  • Implement the epoll_pwait system call, that extend the event wait mechanism
    with the same logic ppoll and pselect do. The definition of epoll_pwait
    is:

    int epoll_pwait(int epfd, struct epoll_event *events, int maxevents,
    int timeout, const sigset_t *sigmask, size_t sigsetsize);

    The difference between the vanilla epoll_wait and epoll_pwait is that the
    latter allows the caller to specify a signal mask to be set while waiting
    for events. Hence epoll_pwait will wait until either one monitored event,
    or an unmasked signal happen. If sigmask is NULL, the epoll_pwait system
    call will act exactly like epoll_wait. For the POSIX definition of
    pselect, information is available here:

    http://www.opengroup.org/onlinepubs/009695399/functions/select.html

    Signed-off-by: Davide Libenzi
    Cc: David Woodhouse
    Cc: Andi Kleen
    Cc: Michael Kerrisk
    Cc: Ulrich Drepper
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     

03 Oct, 2006

1 commit


27 Sep, 2006

1 commit

  • This eliminates the i_blksize field from struct inode. Filesystems that want
    to provide a per-inode st_blksize can do so by providing their own getattr
    routine instead of using the generic_fillattr() function.

    Note that some filesystems were providing pretty much random (and incorrect)
    values for i_blksize.

    [bunk@stusta.de: cleanup]
    [akpm@osdl.org: generic_fillattr() fix]
    Signed-off-by: "Theodore Ts'o"
    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Theodore Ts'o
     

28 Aug, 2006

1 commit


04 Jul, 2006

1 commit


26 Jun, 2006

1 commit

  • A few days ago Arjan signaled a lockdep red flag on epoll locks, and
    precisely between the epoll's device structure lock (->lock) and the wait
    queue head lock (->lock).

    Like I explained in another email, and directly to Arjan, this can't happen
    in reality because of the explicit check at eventpoll.c:592, that does not
    allow to drop an epoll fd inside the same epoll fd. Since lockdep is
    working on per-structure locks, it will never be able to know of policies
    enforced in other parts of the code.

    It was decided time ago of having the ability to drop epoll fds inside
    other epoll fds, that triggers a very trick wakeup operations (due to
    possibly reentrant callback-driven wakeups) handled by the
    ep_poll_safewake() function. While looking again at the code though, I
    noticed that all the operations done on the epoll's main structure wait
    queue head (->wq) are already protected by the epoll lock (->lock), so that
    locked-style functions can be used to manipulate the ->wq member. This
    makes both a lock-acquire save, and lockdep happy.

    Running totalmess on my dual opteron for a while did not reveal any problem
    so far:

    http://www.xmailserver.org/totalmess.c

    Signed-off-by: Davide Libenzi
    Cc: Arjan van de Ven
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     

23 Jun, 2006

1 commit

  • Extend the get_sb() filesystem operation to take an extra argument that
    permits the VFS to pass in the target vfsmount that defines the mountpoint.

    The filesystem is then required to manually set the superblock and root dentry
    pointers. For most filesystems, this should be done with simple_set_mnt()
    which will set the superblock pointer and then set the root dentry to the
    superblock's s_root (as per the old default behaviour).

    The get_sb() op now returns an integer as there's now no need to return the
    superblock pointer.

    This patch permits a superblock to be implicitly shared amongst several mount
    points, such as can be done with NFS to avoid potential inode aliasing. In
    such a case, simple_set_mnt() would not be called, and instead the mnt_root
    and mnt_sb would be set directly.

    The patch also makes the following changes:

    (*) the get_sb_*() convenience functions in the core kernel now take a vfsmount
    pointer argument and return an integer, so most filesystems have to change
    very little.

    (*) If one of the convenience function is not used, then get_sb() should
    normally call simple_set_mnt() to instantiate the vfsmount. This will
    always return 0, and so can be tail-called from get_sb().

    (*) generic_shutdown_super() now calls shrink_dcache_sb() to clean up the
    dcache upon superblock destruction rather than shrink_dcache_anon().

    This is required because the superblock may now have multiple trees that
    aren't actually bound to s_root, but that still need to be cleaned up. The
    currently called functions assume that the whole tree is rooted at s_root,
    and that anonymous dentries are not the roots of trees which results in
    dentries being left unculled.

    However, with the way NFS superblock sharing are currently set to be
    implemented, these assumptions are violated: the root of the filesystem is
    simply a dummy dentry and inode (the real inode for '/' may well be
    inaccessible), and all the vfsmounts are rooted on anonymous[*] dentries
    with child trees.

    [*] Anonymous until discovered from another tree.

    (*) The documentation has been adjusted, including the additional bit of
    changing ext2_* into foo_* in the documentation.

    [akpm@osdl.org: convert ipath_fs, do other stuff]
    Signed-off-by: David Howells
    Acked-by: Al Viro
    Cc: Nathan Scott
    Cc: Roland Dreier
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     

21 Apr, 2006

1 commit


11 Apr, 2006

1 commit


29 Mar, 2006

1 commit

  • This is a conversion to make the various file_operations structs in fs/
    const. Basically a regexp job, with a few manual fixups

    The goal is both to increase correctness (harder to accidentally write to
    shared datastructures) and reducing the false sharing of cachelines with
    things that get dirty in .data (while .rodata is nicely read only and thus
    cache clean)

    Signed-off-by: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arjan van de Ven
     

27 Mar, 2006

1 commit

  • I discovered on oprofile hunting on a SMP platform that dentry lookups were
    slowed down because d_hash_mask, d_hash_shift and dentry_hashtable were in
    a cache line that contained inodes_stat. So each time inodes_stats is
    changed by a cpu, other cpus have to refill their cache line.

    This patch moves some variables to the __read_mostly section, in order to
    avoid false sharing. RCU dentry lookups can go full speed.

    Signed-off-by: Eric Dumazet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     

26 Mar, 2006

1 commit

  • Implement the half-closed devices notifiation, by adding a new POLLRDHUP
    (and its alias EPOLLRDHUP) bit to the existing poll/select sets. Since the
    existing POLLHUP handling, that does not report correctly half-closed
    devices, was feared to be changed, this implementation leaves the current
    POLLHUP reporting unchanged and simply add a new bit that is set in the few
    places where it makes sense. The same thing was discussed and conceptually
    agreed quite some time ago:

    http://lkml.org/lkml/2003/7/12/116

    Since this new event bit is added to the existing Linux poll infrastruture,
    even the existing poll/select system calls will be able to use it. As far
    as the existing POLLHUP handling, the patch leaves it as is. The
    pollrdhup-2.6.16.rc5-0.10.diff defines the POLLRDHUP for all the existing
    archs and sets the bit in the six relevant files. The other attached diff
    is the simple change required to sys/epoll.h to add the EPOLLRDHUP
    definition.

    There is "a stupid program" to test POLLRDHUP delivery here:

    http://www.xmailserver.org/pollrdhup-test.c

    It tests poll(2), but since the delivery is same epoll(2) will work equally.

    Signed-off-by: Davide Libenzi
    Cc: "David S. Miller"
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     

23 Mar, 2006

2 commits

  • Eliminate a handful of cache references by keeping current in a register
    instead of reloading (helps x86) and avoiding the overhead of a function
    call. Inlining eventpoll_init_file() saves 24 bytes. Also reorder file
    initialization to make writes occur more sequentially.

    Signed-off-by: Benjamin LaHaise
    Cc: Davide Libenzi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Benjamin LaHaise
     
  • Semaphore to mutex conversion.

    The conversion was generated via scripts, and the result was validated
    automatically via a script as well.

    Signed-off-by: Arjan van de Ven
    Signed-off-by: Ingo Molnar
    Cc: Davide Libenzi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arjan van de Ven
     

28 Sep, 2005

1 commit


18 Sep, 2005

1 commit

  • Al found a potential problem in epoll_create(), where the
    file->private_data member was set after fd_install(). This is obviously
    wrong since another thread might do a close() on that fd# before we set the
    file->private_data member. This goes over 2.6.13 and passes a few basic
    tests I've done here.

    (akpm: snuck in a kzalloc() cleanup too)

    Signed-off-by: Davide Libenzi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     

24 Jun, 2005

1 commit

  • This patch gets rid of some macro obfuscation from fs/eventpoll.c by
    removing slab allocator wrappers and converting macros to static inline
    functions.

    Signed-off-by: Pekka Enberg
    Acked-by: Davide Libenzi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pekka Enberg
     

06 May, 2005

1 commit


17 Apr, 2005

1 commit

  • Initial git repository build. I'm not bothering with the full history,
    even though we have it. We can create a separate "historical" git
    archive of that later if we want to, and in the meantime it's about
    3.2GB when imported into git - space that would just make the early
    git days unnecessarily complicated, when we don't have a lot of good
    infrastructure for it.

    Let it rip!

    Linus Torvalds