14 Jul, 2005

1 commit

  • Something has changed in the core kernel such that we now get concurrent
    inode write outs, one e.g via pdflush and one via sys_sync or whatever.
    This causes a nasty deadlock in ntfs. The only clean solution
    unfortunately requires a minor vfs api extension.

    First the deadlock analysis:

    Prerequisive knowledge: NTFS has a file $MFT (inode 0) loaded at mount
    time. The NTFS driver uses the page cache for storing the file contents as
    usual. More interestingly this file contains the table of on-disk inodes
    as a sequence of MFT_RECORDs. Thus NTFS driver accesses the on-disk inodes
    by accessing the MFT_RECORDs in the page cache pages of the loaded inode
    $MFT.

    The situation: VFS inode X on a mounted ntfs volume is dirty. For same
    inode X, the ntfs_inode is dirty and thus corresponding on-disk inode,
    which is as explained above in a dirty PAGE_CACHE_PAGE belonging to the
    table of inodes ($MFT, inode 0).

    What happens:

    Process 1: sys_sync()/umount()/whatever... calls __sync_single_inode() for
    $MFT -> do_writepages() -> write_page for the dirty page containing the
    on-disk inode X, the page is now locked -> ntfs_write_mst_block() which
    clears PageUptodate() on the page to prevent anyone else getting hold of it
    whilst it does the write out (this is necessary as the on-disk inode needs
    "fixups" applied before the write to disk which are removed again after the
    write and PageUptodate is then set again). It then analyses the page
    looking for dirty on-disk inodes and when it finds one it calls
    ntfs_may_write_mft_record() to see if it is safe to write this on-disk
    inode. This then calls ilookup5() to check if the corresponding VFS inode
    is in icache(). This in turn calls ifind() which waits on the inode lock
    via wait_on_inode whilst holding the global inode_lock.

    Process 2: pdflush results in a call to __sync_single_inode for the same
    VFS inode X on the ntfs volume. This locks the inode (I_LOCK) then calls
    write-inode -> ntfs_write_inode -> map_mft_record() -> read_cache_page() of
    the page (in page cache of table of inodes $MFT, inode 0) containing the
    on-disk inode. This page has PageUptodate() clear because of Process 1
    (see above) so read_cache_page() blocks when tries to take the page lock
    for the page so it can call ntfs_read_page().

    Thus Process 1 is holding the page lock on the page containing the on-disk
    inode X and it is waiting on the inode X to be unlocked in ifind() so it
    can write the page out and then unlock the page.

    And Process 2 is holding the inode lock on inode X and is waiting for the
    page to be unlocked so it can call ntfs_readpage() or discover that
    Process 1 set PageUptodate() again and use the page.

    Thus we have a deadlock due to ifind() waiting on the inode lock.

    The only sensible solution: NTFS does not care whether the VFS inode is
    locked or not when it calls ilookup5() (it doesn't use the VFS inode at
    all, it just uses it to find the corresponding ntfs_inode which is of
    course attached to the VFS inode (both are one single struct); and it uses
    the ntfs_inode which is subject to its own locking so I_LOCK is irrelevant)
    hence we want a modified ilookup5_nowait() which is the same as ilookup5()
    but it does not wait on the inode lock.

    Without such functionality I would have to keep my own ntfs_inode cache in
    the NTFS driver just so I can find ntfs_inodes independent of their VFS
    inodes which would be slow, memory and cpu cycle wasting, and incredibly
    stupid given the icache already exists in the VFS.

    Below is a patch that does the ilookup5_nowait() implementation in
    fs/inode.c and exports it.

    ilookup5_nowait.diff:

    Introduce ilookup5_nowait() which is basically the same as ilookup5() but
    it does not wait on the inode's lock (i.e. it omits the wait_on_inode()
    done in ifind()).

    This is needed to avoid a nasty deadlock in NTFS.

    Signed-off-by: Anton Altaparmakov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anton Altaparmakov
     

13 Jul, 2005

1 commit

  • inotify is intended to correct the deficiencies of dnotify, particularly
    its inability to scale and its terrible user interface:

    * dnotify requires the opening of one fd per each directory
    that you intend to watch. This quickly results in too many
    open files and pins removable media, preventing unmount.
    * dnotify is directory-based. You only learn about changes to
    directories. Sure, a change to a file in a directory affects
    the directory, but you are then forced to keep a cache of
    stat structures.
    * dnotify's interface to user-space is awful. Signals?

    inotify provides a more usable, simple, powerful solution to file change
    notification:

    * inotify's interface is a system call that returns a fd, not SIGIO.
    You get a single fd, which is select()-able.
    * inotify has an event that says "the filesystem that the item
    you were watching is on was unmounted."
    * inotify can watch directories or files.

    Inotify is currently used by Beagle (a desktop search infrastructure),
    Gamin (a FAM replacement), and other projects.

    See Documentation/filesystems/inotify.txt.

    Signed-off-by: Robert Love
    Cc: John McCutchan
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Robert Love
     

08 Jul, 2005

1 commit

  • OCFS2 wants to mark an inode which has been orphaned by another node so
    that during final iput it takes the correct path through the VFS and can
    pass through the OCFS2 delete_inode callback. Since i_nlink can get out of
    date with other nodes, the best way I see to accomplish this is by clearing
    i_nlink on those inodes at drop_inode time. Other than this small amount
    of work, nothing different needs to happen, so I think it would be cleanest
    to be able to just call generic_drop_inode at the end of the OCFS2
    drop_inode callback.

    Signed-off-by: Mark Fasheh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mark Fasheh
     

28 Jun, 2005

1 commit

  • This updates the CFQ io scheduler to the new time sliced design (cfq
    v3). It provides full process fairness, while giving excellent
    aggregate system throughput even for many competing processes. It
    supports io priorities, either inherited from the cpu nice value or set
    directly with the ioprio_get/set syscalls. The latter closely mimic
    set/getpriority.

    This import is based on my latest from -mm.

    Signed-off-by: Jens Axboe
    Signed-off-by: Linus Torvalds

    Jens Axboe
     

24 Jun, 2005

8 commits

  • This patch reworks filemap_xip.c with the goal to reduce code duplication
    from mm/filemap.c. It applies agains 2.6.12-rc6-mm1. Instead of
    implementing the aio functions, this one implements the synchronous
    read/write functions only. For readv and writev, the generic fallback is
    used. For aio, we rely on the application doing the fallback. Since our
    "synchronous" function does memcpy immediately anyway, there is no
    performance difference between using the fallbacks or implementing each
    operation.

    Signed-off-by: Carsten Otte
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Carsten Otte
     
  • These are the ext2 related parts. Ext2 now uses the xip_* file operations
    along with the get_xip_page aop when mounted with -o xip.

    Signed-off-by: Carsten Otte
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Carsten Otte
     
  • - generic_file* file operations do no longer have a xip/non-xip split
    - filemap_xip.c implements a new set of fops that require get_xip_page
    aop to work proper. all new fops are exported GPL-only (don't like to
    see whatever code use those except GPL modules)
    - __xip_unmap now uses page_check_address, which is no longer static
    in rmap.c, and defined in linux/rmap.h
    - mm/filemap.h is now much more clean, plainly having just Linus'
    inline funcs moved here from filemap.c
    - fix includes in filemap_xip to make it build cleanly on i386

    Signed-off-by: Carsten Otte
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Carsten Otte
     
  • This is the block device related part. The block device operation
    direct_access now has a struct block_device as first parameter.

    Signed-off-by: Carsten Otte
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Carsten Otte
     
  • XFS will have to look at iocb->private to fix aio+dio. No other filesystem
    is using the blockdev_direct_IO* end_io callback.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • The following patch removes the f_error field and all checks of f_error.

    Trond said:

    f_error was introduced for NFS, and made sense when we were guaranteed
    always to have a file pointer around when write errors occurred. Since
    then, we have (for various reasons) had to introduce the nfs_open_context in
    order to track the file read/write state, and it made sense to move our
    f_error tracking there too.

    Signed-off-by: Christoph Lameter
    Acked-by: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • This patch allows block device drivers to convert their ioctl functions to
    unlocked_ioctl() like character devices and other subsystems. All
    functions that were called with the BKL held before are still used that
    way, but I would not be surprised if it could be removed from the ioctl
    functions in drivers/block/ioctl.c themselves.

    As a side note, I found that compat_blkdev_ioctl() acquires the BKL as
    well, which looks like a bug. I have checked that every user of
    disk->fops->compat_ioctl() in the current git tree gets the BKL itself, so
    it could easily be removed from compat_blkdev_ioctl().

    Signed-off-by: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arnd Bergmann
     
  • Based on analysis and a patch from Russ Weight

    There is a race condition that can occur if an inode is allocated and then
    released (using iput) during the ->fill_super functions. The race
    condition is between kswapd and mount.

    For most filesystems this can only happen in an error path when kswapd is
    running concurrently. For isofs, however, the error can occur in a more
    common code path (which is how the bug was found).

    The logic here is "we want final iput() to free inode *now* instead of
    letting it sit in cache if fs is going down or had not quite come up". The
    problem is with kswapd seeing such inodes in the middle of being killed and
    happily taking over.

    The clean solution would be to tell kswapd to leave those inodes alone and
    let our final iput deal with them. I.e. add a new flag
    (I_FORCED_FREEING), set it before write_inode_now() there and make
    prune_icache() leave those alone.

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Viro
     

23 Jun, 2005

1 commit


21 Jun, 2005

1 commit

  • Based on the discussion about spufs attributes, this is my suggestion
    for a more generic attribute file support that can be used by both
    debugfs and spufs.

    Simple attribute files behave similarly to sequential files from
    a kernel programmers perspective in that a standard set of file
    operations is provided and only an open operation needs to
    be written that registers file specific get() and set() functions.

    These operations are defined as

    void foo_set(void *data, u64 val); and
    u64 foo_get(void *data);

    where data is the inode->u.generic_ip pointer of the file and the
    operations just need to make send of that pointer. The infrastructure
    makes sure this works correctly with concurrent access and partial
    read calls.

    A macro named DEFINE_SIMPLE_ATTRIBUTE is provided to further simplify
    using the attributes.

    This patch already contains the changes for debugfs to use attributes
    for its internal file operations.

    Signed-off-by: Arnd Bergmann
    Signed-off-by: Greg Kroah-Hartman

    Arnd Bergmann
     

06 May, 2005

1 commit


01 May, 2005

2 commits

  • Some KernelDoc descriptions are updated to match the current code.
    No code changes.

    Signed-off-by: Martin Waitz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Martin Waitz
     
  • I have recompiled Linux kernel 2.6.11.5 documentation for me and our
    university students again. The documentation could be extended for more
    sources which are equipped by structured comments for recent 2.6 kernels. I
    have tried to proceed with that task. I have done that more times from 2.6.0
    time and it gets boring to do same changes again and again. Linux kernel
    compiles after changes for i386 and ARM targets. I have added references to
    some more files into kernel-api book, I have added some section names as well.
    So please, check that changes do not break something and that categories are
    not too much skewed.

    I have changed kernel-doc to accept "fastcall" and "asmlinkage" words reserved
    by kernel convention. Most of the other changes are modifications in the
    comments to make kernel-doc happy, accept some parameters description and do
    not bail out on errors. Changed to @pid in the description, moved some
    #ifdef before comments to correct function to comments bindings, etc.

    You can see result of the modified documentation build at
    http://cmp.felk.cvut.cz/~pisa/linux/lkdb-2.6.11.tar.gz

    Some more sources are ready to be included into kernel-doc generated
    documentation. Sources has been added into kernel-api for now. Some more
    section names added and probably some more chaos introduced as result of quick
    cleanup work.

    Signed-off-by: Pavel Pisa
    Signed-off-by: Martin Waitz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Pisa
     

17 Apr, 2005

1 commit

  • Initial git repository build. I'm not bothering with the full history,
    even though we have it. We can create a separate "historical" git
    archive of that later if we want to, and in the meantime it's about
    3.2GB when imported into git - space that would just make the early
    git days unnecessarily complicated, when we don't have a lot of good
    infrastructure for it.

    Let it rip!

    Linus Torvalds