27 Mar, 2006

1 commit

  • I discovered on oprofile hunting on a SMP platform that dentry lookups were
    slowed down because d_hash_mask, d_hash_shift and dentry_hashtable were in
    a cache line that contained inodes_stat. So each time inodes_stats is
    changed by a cpu, other cpus have to refill their cache line.

    This patch moves some variables to the __read_mostly section, in order to
    avoid false sharing. RCU dentry lookups can go full speed.

    Signed-off-by: Eric Dumazet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     

26 Mar, 2006

1 commit


24 Mar, 2006

1 commit

  • Change the kmem_cache_create calls for certain slab caches to support cpuset
    memory spreading.

    See the previous patches, cpuset_mem_spread, for an explanation of cpuset
    memory spreading, and cpuset_mem_spread_slab_cache for the slab cache support
    for memory spreading.

    The slab caches marked for now are: dentry_cache, inode_cache, some xfs slab
    caches, and buffer_head. This list may change over time. In particular,
    other file system types that are used extensively on large NUMA systems may
    want to allow for spreading their directory and inode slab cache entries.

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     

23 Mar, 2006

2 commits

  • Semaphore to mutex conversion.

    The conversion was generated via scripts, and the result was validated
    automatically via a script as well.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • Semaphore to mutex conversion.

    The conversion was generated via scripts, and the result was validated
    automatically via a script as well.

    Signed-off-by: Ingo Molnar
    Cc: John McCutchan
    Signed-off-by: Andrew Morton
    Acked-by: Robert Love
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     

02 Feb, 2006

1 commit


11 Jan, 2006

3 commits

  • Turn noatime and nodiratime into per-mount instead of per-sb flags.

    After all the preparations this is a rather trivial patch. The mount code
    needs to treat the two options as per-mount instead of per-superblock, and
    touch_atime needs to be changed to check the new MNT_ flags in addition to
    the MS_ flags that are kept for filesystems that are always
    noatime/nodiratime but not user settable anymore. Besides that core code
    only nfs needed an update because it's leaving atime updates to the server
    and thus sets the S_NOATIME flag on every inode, but needs to know whether
    it's a real noatime mount for an getattr optimization.

    While we're at it I've killed the IS_NOATIME/IS_NODIRATIME macros that were
    only used by touch_atime.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • All callers use touch_atime now which takes a vfsmount and allows us to
    implement per-mount noatime.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • To allow various options to work per-mount instead of per-sb we need a
    struct vfsmount when updating ctime and mtime. This preparation patch
    replaces the inode_update_time routine with a file_update_atime routine so
    we can easily get at the vfsmount. (and the file makes more sense in this
    context anyway). Also get rid of the unused second argument - we always
    want to update the ctime when calling this routine.

    Signed-off-by: Christoph Hellwig
    Cc: Al Viro
    Cc: Anton Altaparmakov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     

10 Jan, 2006

1 commit


09 Jan, 2006

1 commit

  • uninline a couple inode.c functions

    add/remove: 2/0 grow/shrink: 0/5 up/down: 256/-428 (-172)
    function old new delta
    ifind - 136 +136
    ifind_fast - 120 +120
    ilookup5_nowait 131 80 -51
    ilookup 158 71 -87
    ilookup5 171 80 -91
    iget_locked 190 95 -95
    iget5_locked 240 136 -104

    Signed-off-by: Matt Mackall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matt Mackall
     

31 Oct, 2005

1 commit

  • list_move(&inode->i_list, &inode_in_use);
    } else {
    list_move(&inode->i_list, &inode_unused);
    + inodes_stat.nr_unused++;
    }
    }
    wake_up_inode(inode);

    Are you sure the above diff is correct? It was added somewhere between
    2.6.5 and 2.6.8. I think it's wrong.

    The only way I can imagine the i_count to be zero in the above path, is
    that I_WILL_FREE is set. And if I_WILL_FREE is set, then we must not
    increase nr_unused. So I believe the above change is buggy and it will
    definitely overstate the number of unused inodes and it should be backed
    out.

    Note that __writeback_single_inode before calling __sync_single_inode, can
    drop the spinlock and we can have both the dirty and locked bitflags clear
    here:

    spin_unlock(&inode_lock);
    __wait_on_inode(inode);
    iput(inode);
    XXXXXXX
    spin_lock(&inode_lock);
    }
    use inode again here

    a construct like the above makes zero sense from a reference counting
    standpoint.

    Either we don't ever use the inode again after the iput, or the
    inode_lock should be taken _before_ executing the iput (i.e. a __iput
    would be required). Taking the inode_lock after iput means the iget was
    useless if we keep using the inode after the iput.

    So the only chance the 2.6 was safe to call __writeback_single_inode
    with the i_count == 0, is that I_WILL_FREE is set (I_WILL_FREE will
    prevent the VM to free the inode in XXXXX).

    Potentially calling the above iput with I_WILL_FREE was also wrong
    because it would recurse in iput_final (the second mainline bug).

    The below (untested) patch fixes the nr_unused accounting, avoids recursing
    in iput when I_WILL_FREE is set and makes sure (with the BUG_ON) that we
    don't corrupt memory and that all holders that don't set I_WILL_FREE, keeps
    a reference on the inode!

    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

28 Oct, 2005

1 commit

  • - ->releasepage() annotated (s/int/gfp_t), instances updated
    - missing gfp_t in fs/* added
    - fixed misannotation from the original sweep caught by bitwise checks:
    XFS used __nocast both for gfp_t and for flags used by XFS allocator.
    The latter left with unsigned int __nocast; we might want to add a
    different type for those but for now let's leave them alone. That,
    BTW, is a case when __nocast use had been actively confusing - it had
    been used in the same code for two different and similar types, with
    no way to catch misuses. Switch of gfp_t to bitwise had caught that
    immediately...

    One tricky bit is left alone to be dealt with later - mapping->flags is
    a mix of gfp_t and error indications. Left alone for now.

    Signed-off-by: Al Viro
    Signed-off-by: Linus Torvalds

    Al Viro
     

10 Sep, 2005

1 commit

  • Allow file systems supporting ->delete_inode() to call
    truncate_inode_pages() on their own. OCFS2 wants this so it can query the
    cluster before making a final decision on whether to wipe an inode from
    disk or not. In some corner cases an inode marked on the local node via
    voting may not actually get orphaned. A good example is node death before
    the transaction moving the inode to the orphan dir commits to the journal.
    Without this patch, the truncate_inode_pages() call in
    generic_delete_inode() would discard valid data for such inodes.

    During earlier discussion in the 2.6.13 merge plan thread, Christoph
    Hellwig indicated that other file systems might also find this useful.

    IMHO, the best solution would be to just allow ->drop_inode() to do the
    cluster query but it seems that would require a substantial reworking of
    that section of the code. Assuming it is safe to call write_inode_now() in
    ocfs2_delete_inode() for those inodes which won't actually get wiped, this
    solution should get us by for now.

    Trivial testing of this patch (and a related OCFS2 update) has shown this
    to avoid the corruption I'm seeing.

    Signed-off-by: Mark Fasheh
    Acked-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mark Fasheh
     

08 Sep, 2005

1 commit


14 Jul, 2005

1 commit

  • Something has changed in the core kernel such that we now get concurrent
    inode write outs, one e.g via pdflush and one via sys_sync or whatever.
    This causes a nasty deadlock in ntfs. The only clean solution
    unfortunately requires a minor vfs api extension.

    First the deadlock analysis:

    Prerequisive knowledge: NTFS has a file $MFT (inode 0) loaded at mount
    time. The NTFS driver uses the page cache for storing the file contents as
    usual. More interestingly this file contains the table of on-disk inodes
    as a sequence of MFT_RECORDs. Thus NTFS driver accesses the on-disk inodes
    by accessing the MFT_RECORDs in the page cache pages of the loaded inode
    $MFT.

    The situation: VFS inode X on a mounted ntfs volume is dirty. For same
    inode X, the ntfs_inode is dirty and thus corresponding on-disk inode,
    which is as explained above in a dirty PAGE_CACHE_PAGE belonging to the
    table of inodes ($MFT, inode 0).

    What happens:

    Process 1: sys_sync()/umount()/whatever... calls __sync_single_inode() for
    $MFT -> do_writepages() -> write_page for the dirty page containing the
    on-disk inode X, the page is now locked -> ntfs_write_mst_block() which
    clears PageUptodate() on the page to prevent anyone else getting hold of it
    whilst it does the write out (this is necessary as the on-disk inode needs
    "fixups" applied before the write to disk which are removed again after the
    write and PageUptodate is then set again). It then analyses the page
    looking for dirty on-disk inodes and when it finds one it calls
    ntfs_may_write_mft_record() to see if it is safe to write this on-disk
    inode. This then calls ilookup5() to check if the corresponding VFS inode
    is in icache(). This in turn calls ifind() which waits on the inode lock
    via wait_on_inode whilst holding the global inode_lock.

    Process 2: pdflush results in a call to __sync_single_inode for the same
    VFS inode X on the ntfs volume. This locks the inode (I_LOCK) then calls
    write-inode -> ntfs_write_inode -> map_mft_record() -> read_cache_page() of
    the page (in page cache of table of inodes $MFT, inode 0) containing the
    on-disk inode. This page has PageUptodate() clear because of Process 1
    (see above) so read_cache_page() blocks when tries to take the page lock
    for the page so it can call ntfs_read_page().

    Thus Process 1 is holding the page lock on the page containing the on-disk
    inode X and it is waiting on the inode X to be unlocked in ifind() so it
    can write the page out and then unlock the page.

    And Process 2 is holding the inode lock on inode X and is waiting for the
    page to be unlocked so it can call ntfs_readpage() or discover that
    Process 1 set PageUptodate() again and use the page.

    Thus we have a deadlock due to ifind() waiting on the inode lock.

    The only sensible solution: NTFS does not care whether the VFS inode is
    locked or not when it calls ilookup5() (it doesn't use the VFS inode at
    all, it just uses it to find the corresponding ntfs_inode which is of
    course attached to the VFS inode (both are one single struct); and it uses
    the ntfs_inode which is subject to its own locking so I_LOCK is irrelevant)
    hence we want a modified ilookup5_nowait() which is the same as ilookup5()
    but it does not wait on the inode lock.

    Without such functionality I would have to keep my own ntfs_inode cache in
    the NTFS driver just so I can find ntfs_inodes independent of their VFS
    inodes which would be slow, memory and cpu cycle wasting, and incredibly
    stupid given the icache already exists in the VFS.

    Below is a patch that does the ilookup5_nowait() implementation in
    fs/inode.c and exports it.

    ilookup5_nowait.diff:

    Introduce ilookup5_nowait() which is basically the same as ilookup5() but
    it does not wait on the inode's lock (i.e. it omits the wait_on_inode()
    done in ifind()).

    This is needed to avoid a nasty deadlock in NTFS.

    Signed-off-by: Anton Altaparmakov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anton Altaparmakov
     

13 Jul, 2005

3 commits

  • inotify is intended to correct the deficiencies of dnotify, particularly
    its inability to scale and its terrible user interface:

    * dnotify requires the opening of one fd per each directory
    that you intend to watch. This quickly results in too many
    open files and pins removable media, preventing unmount.
    * dnotify is directory-based. You only learn about changes to
    directories. Sure, a change to a file in a directory affects
    the directory, but you are then forced to keep a cache of
    stat structures.
    * dnotify's interface to user-space is awful. Signals?

    inotify provides a more usable, simple, powerful solution to file change
    notification:

    * inotify's interface is a system call that returns a fd, not SIGIO.
    You get a single fd, which is select()-able.
    * inotify has an event that says "the filesystem that the item
    you were watching is on was unmounted."
    * inotify can watch directories or files.

    Inotify is currently used by Beagle (a desktop search infrastructure),
    Gamin (a FAM replacement), and other projects.

    See Documentation/filesystems/inotify.txt.

    Signed-off-by: Robert Love
    Cc: John McCutchan
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Robert Love
     
  • Bug symptoms
    ~~~~~~~~~~~~
    For the same inode VFS calls read_inode() twice and doesn't call
    clear_inode() between the two read_inode() invocations.

    Bug description
    ~~~~~~~~~~~~~~~
    Suppose we have an inode which has zero reference count but is still in
    the inode cache. Suppose kswapd invokes shrink_icache_memory() to free
    some RAM. In prune_icache() inodes are removed from i_hash. prune_icache
    () is then going to call clear_inode(), but drops the inode_lock
    spinlock before this. If in this moment another task calls iget() for an
    inode which was just removed from i_hash by prune_icache(), then iget()
    invokes read_inode() for this inode, because it is *already removed*
    from i_hash.

    The end result is: we call iget(#N) then iput(#N); inode #N has zero
    i_count now and is in the inode cache; kswapd starts. kswapd removes the
    inode #N from i_hash ans is preempted; we call iget(#N) again;
    read_inode() is invoked as the result; but we expect clear_inode()
    before.

    Fix
    ~~~~~~~
    To fix the bug I remove inodes from i_hash later, when clear_inode() is
    actually called. I remove them from i_hash under spinlock protection.
    Since the i_state is set to I_FREEING, it is safe to do this. The others
    will sleep waiting for the inode state change.

    I also postpone removing inodes from i_sb_list. It is not compulsory to
    do so but I do it for readability reasons. Inodes are added/removed to
    the lists together everywhere in the code and there is no point to
    change this rule. This is harmless because the only user of i_sb_list
    which somehow may interfere with me (invalidate_list()) is excluded by
    the iprune_sem mutex.

    The same race is possible in invalidate_list() so I do the same for it.

    Acked-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Artem B. Bityuckiy
     
  • This patch fixes queer behavior in __wait_on_freeing_inode().

    If I_LOCK was not set it called yield(), effectively busy waiting for the
    removal of the inode from the hash. This change was introduced within
    "[PATCH] eliminate inode waitqueue hashtable" Changeset 1.1938.166.16 last
    october by wli.

    The solution is to restore the old behavior, of unconditionally waiting on
    the waitqueue. It doesn't matter if I_LOCK is not set initally, the task
    will go to sleep, and wake up when wake_up_inode() is called from
    generic_delete_inode() after removing the inode from the hash chain.

    Comment is also updated to better reflect current behavior.

    This condition is very hard to trigger normally (simultaneous clear_inode()
    with iget()) so probably only heavy stress testing can reveal any change of
    behavior.

    Signed-off-by: Miklos Szeredi
    Acked-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     

08 Jul, 2005

1 commit

  • OCFS2 wants to mark an inode which has been orphaned by another node so
    that during final iput it takes the correct path through the VFS and can
    pass through the OCFS2 delete_inode callback. Since i_nlink can get out of
    date with other nodes, the best way I see to accomplish this is by clearing
    i_nlink on those inodes at drop_inode time. Other than this small amount
    of work, nothing different needs to happen, so I think it would be cleanest
    to be able to just call generic_drop_inode at the end of the OCFS2
    drop_inode callback.

    Signed-off-by: Mark Fasheh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mark Fasheh
     

24 Jun, 2005

1 commit

  • Based on analysis and a patch from Russ Weight

    There is a race condition that can occur if an inode is allocated and then
    released (using iput) during the ->fill_super functions. The race
    condition is between kswapd and mount.

    For most filesystems this can only happen in an error path when kswapd is
    running concurrently. For isofs, however, the error can occur in a more
    common code path (which is how the bug was found).

    The logic here is "we want final iput() to free inode *now* instead of
    letting it sit in cache if fs is going down or had not quite come up". The
    problem is with kswapd seeing such inodes in the middle of being killed and
    happily taking over.

    The clean solution would be to tell kswapd to leave those inodes alone and
    let our final iput deal with them. I.e. add a new flag
    (I_FORCED_FREEING), set it before write_inode_now() there and make
    prune_icache() leave those alone.

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Viro
     

06 May, 2005

2 commits


17 Apr, 2005

1 commit

  • Initial git repository build. I'm not bothering with the full history,
    even though we have it. We can create a separate "historical" git
    archive of that later if we want to, and in the meantime it's about
    3.2GB when imported into git - space that would just make the early
    git days unnecessarily complicated, when we don't have a lot of good
    infrastructure for it.

    Let it rip!

    Linus Torvalds