15 Aug, 2008

1 commit

  • write_cache_pages() uses i_mapping->writeback_index to pick up where it
    left off the last time a given inode was found by pdflush or
    balance_dirty_pages (or anyone else who sets wbc->range_cyclic)

    alloc_inode() should set it to a sane value so that writeback doesn't
    start in the middle of a file. It is somewhat difficult to notice the bug
    since write_cache_pages will loop around to the start of the file and the
    elevator helps hide the resulting seeks.

    For whatever reason, Btrfs hits this often. Unpatched, untarring 30
    copies of the linux kernel in series runs at 47MB/s on a single sata
    drive. With this fix, it jumps to 62MB/s.

    Signed-off-by: Chris Mason
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chris Mason
     

27 Jul, 2008

2 commits

  • Kmem cache passed to constructor is only needed for constructors that are
    themselves multiplexeres. Nobody uses this "feature", nor does anybody uses
    passed kmem cache in non-trivial way, so pass only pointer to object.

    Non-trivial places are:
    arch/powerpc/mm/init_64.c
    arch/powerpc/mm/hugetlbpage.c

    This is flag day, yes.

    Signed-off-by: Alexey Dobriyan
    Acked-by: Pekka Enberg
    Acked-by: Christoph Lameter
    Cc: Jon Tollefson
    Cc: Nick Piggin
    Cc: Matt Mackall
    [akpm@linux-foundation.org: fix arch/powerpc/mm/hugetlbpage.c]
    [akpm@linux-foundation.org: fix mm/slab.c]
    [akpm@linux-foundation.org: fix ubifs]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • mapping->tree_lock has no read lockers. convert the lock from an rwlock
    to a spinlock.

    Signed-off-by: Nick Piggin
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Hugh Dickins
    Cc: "Paul E. McKenney"
    Reviewed-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

07 May, 2008

2 commits

  • Commit 33dcdac2df54e66c447ae03f58c95c7251aa5649 ("kill ->put_inode")
    removed the final use of i_op->put_inode, but left the now totally
    unused "op" variable in iput().

    Get rid of it.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • And with that last patch to affs killing the last put_inode instance we
    can finally, after many years of transition kill this racy and awkward
    interface.

    (It's kinda funny that even the description in
    Documentation/filesystems/vfs.txt was entirely wrong..)

    Also remove a very misleading comment above the defintion of
    struct super_operations.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     

29 Apr, 2008

1 commit


19 Apr, 2008

2 commits


08 Feb, 2008

1 commit

  • Remove the old iget() call and the read_inode() superblock operation it uses
    as these are really obsolete, and the use of read_inode() does not produce
    proper error handling (no distinction between ENOMEM and EIO when marking an
    inode bad).

    Furthermore, this removes the temptation to use iget() to find an inode by
    number in a filesystem from code outside that filesystem.

    iget_locked() should be used instead. A new function is added in an earlier
    patch (iget_failed) that is to be called to mark an inode as bad, unlock it
    and release it should the get routine fail. Mark iget() and read_inode() as
    being obsolete and remove references to them from the documentation.

    Typically a filesystem will be modified such that the read_inode function
    becomes an internal iget function, for example the following:

    void thingyfs_read_inode(struct inode *inode)
    {
    ...
    }

    would be changed into something like:

    struct inode *thingyfs_iget(struct super_block *sp, unsigned long ino)
    {
    struct inode *inode;
    int ret;

    inode = iget_locked(sb, ino);
    if (!inode)
    return ERR_PTR(-ENOMEM);
    if (!(inode->i_state & I_NEW))
    return inode;

    ...
    unlock_new_inode(inode);
    return inode;
    error:
    iget_failed(inode);
    return ERR_PTR(ret);
    }

    and then thingyfs_iget() would be called rather than iget(), for example:

    ret = -EINVAL;
    inode = iget(sb, ino);
    if (!inode || is_bad_inode(inode))
    goto error;

    becomes:

    inode = thingyfs_iget(sb, ino);
    if (IS_ERR(inode)) {
    ret = PTR_ERR(inode);
    goto error;
    }

    Note that is_bad_inode() does not need to be called. The error returned by
    thingyfs_iget() should render it unnecessary.

    Signed-off-by: David Howells
    Acked-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     

29 Jan, 2008

2 commits

  • This patch adds 64-bit inode version support to ext4. The lower 32 bits
    are stored in the osd1.linux1.l_i_version field while the high 32 bits
    are stored in the i_version_hi field newly created in the ext4_inode.
    This field is incremented in case the ext4_inode is large enough. A
    i_version mount option has been added to enable the feature.

    Signed-off-by: Mingming Cao
    Signed-off-by: Andreas Dilger
    Signed-off-by: Kalpak Shah
    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Jean Noel Cordenner

    Jean Noel Cordenner
     
  • The i_version field of the inode is changed to be a 64-bit counter that
    is set on every inode creation and that is incremented every time the
    inode data is modified (similarly to the "ctime" time-stamp).
    The aim is to fulfill a NFSv4 requirement for rfc3530.
    This first part concerns the vfs, it converts the 32-bit i_version in
    the generic inode to a 64-bit, a flag is added in the super block in
    order to check if the feature is enabled and the i_version is
    incremented in the vfs.

    Signed-off-by: Mingming Cao
    Signed-off-by: Jean Noel Cordenner
    Signed-off-by: Kalpak Shah

    Jean Noel Cordenner
     

17 Oct, 2007

4 commits

  • I_LOCK was used for several unrelated purposes, which caused deadlock
    situations in certain filesystems as a side effect. One of the purposes
    now uses the new I_SYNC bit.

    Also document the various bits and change their order from historical to
    logical.

    [bunk@stusta.de: make fs/inode.c:wake_up_inode() static]
    Signed-off-by: Joern Engel
    Cc: Dave Kleikamp
    Cc: David Chinner
    Cc: Anton Altaparmakov
    Cc: Al Viro
    Cc: Christoph Hellwig
    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joern Engel
     
  • Since the mempages parameter is actually not used, they should be removed.

    Now there is only files_init use the mempages parameter,

    files_init(mempages);

    but I don't think the adaptation to mempages in files_init is really
    useful; and if files_init also changed to the prototype void (*func)(void),
    the wrapper vfs_caches_init would also not need the mempages parameter.

    Signed-off-by: Denis Cheng
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Denis Cheng
     
  • Slab constructors currently have a flags parameter that is never used. And
    the order of the arguments is opposite to other slab functions. The object
    pointer is placed before the kmem_cache pointer.

    Convert

    ctor(void *object, struct kmem_cache *s, unsigned long flags)

    to

    ctor(struct kmem_cache *s, void *object)

    throughout the kernel

    [akpm@linux-foundation.org: coupla fixes]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • A slight oversight tripped lockdep debugging code, each lockdep
    class should have but a single init site.

    Rearange the code to make this true.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     

15 Oct, 2007

1 commit


14 Oct, 2007

1 commit

  • On Mon, 2007-09-24 at 22:13 -0400, Steven Rostedt wrote:
    > The circular lock seems to be this:
    >
    > #1:
    >
    > sys_mmap2: down_write(&mm->mmap_sem);
    > nfs_revalidate_mapping: mutex_lock(&inode->i_mutex);
    >
    >
    > #0:
    >
    > vfs_readdir: mutex_lock(&inode->i_mutex);
    > - during the readdir (filldir64), we take a user fault (missing page?)
    > and call do_page_fault -
    > do_page_fault: down_read(&mm->mmap_sem);
    >
    >
    > So it does indeed look like a circular locking. Now the question is, "is
    > this a bug?". Looking like the inode of #1 must be a file or something
    > else that you can mmap and the inode of #0 seems it must be a directory.
    > I would say "no".
    >
    > Now if you can readdir on a file or mmap a directory, then this could be
    > an issue.
    >
    > Otherwise, I'd love to see someone teach lockdep about this issue! ;-)

    Make a distinction between file and dir usage of i_mutex.
    The inode should be complete and unused at unlock_new_inode(), re-init
    i_mutex depending on its type.

    Signed-off-by: Peter Zijlstra

    Peter Zijlstra
     

20 Jul, 2007

1 commit

  • Slab destructors were no longer supported after Christoph's
    c59def9f222d44bb7e2f0a559f2906191a0862d7 change. They've been
    BUGs for both slab and slub, and slob never supported them
    either.

    This rips out support for the dtor pointer from kmem_cache_create()
    completely and fixes up every single callsite in the kernel (there were
    about 224, not including the slab allocator definitions themselves,
    or the documentation references).

    Signed-off-by: Paul Mundt

    Paul Mundt
     

18 Jul, 2007

2 commits

  • I can never remember what the function to register to receive VM pressure
    is called. I have to trace down from __alloc_pages() to find it.

    It's called "set_shrinker()", and it needs Your Help.

    1) Don't hide struct shrinker. It contains no magic.
    2) Don't allocate "struct shrinker". It's not helpful.
    3) Call them "register_shrinker" and "unregister_shrinker".
    4) Call the function "shrink" not "shrinker".
    5) Reduce the 17 lines of waffly comments to 13, but document it properly.

    Signed-off-by: Rusty Russell
    Cc: David Chinner
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rusty Russell
     
  • It is often known at allocation time whether a page may be migrated or not.
    This patch adds a flag called __GFP_MOVABLE and a new mask called
    GFP_HIGH_MOVABLE. Allocations using the __GFP_MOVABLE can be either migrated
    using the page migration mechanism or reclaimed by syncing with backing
    storage and discarding.

    An API function very similar to alloc_zeroed_user_highpage() is added for
    __GFP_MOVABLE allocations called alloc_zeroed_user_highpage_movable(). The
    flags used by alloc_zeroed_user_highpage() are not changed because it would
    change the semantics of an existing API. After this patch is applied there
    are no in-kernel users of alloc_zeroed_user_highpage() so it probably should
    be marked deprecated if this patch is merged.

    Note that this patch includes a minor cleanup to the use of __GFP_ZERO in
    shmem.c to keep all flag modifications to inode->mapping in the
    shmem_dir_alloc() helper function. This clean-up suggestion is courtesy of
    Hugh Dickens.

    Additional credit goes to Christoph Lameter and Linus Torvalds for shaping the
    concept. Credit to Hugh Dickens for catching issues with shmem swap vector
    and ramfs allocations.

    [akpm@linux-foundation.org: build fix]
    [hugh@veritas.com: __GFP_ZERO cleanup]
    Signed-off-by: Mel Gorman
    Cc: Andy Whitcroft
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

17 May, 2007

1 commit

  • SLAB_CTOR_CONSTRUCTOR is always specified. No point in checking it.

    Signed-off-by: Christoph Lameter
    Cc: David Howells
    Cc: Jens Axboe
    Cc: Steven French
    Cc: Michael Halcrow
    Cc: OGAWA Hirofumi
    Cc: Miklos Szeredi
    Cc: Steven Whitehouse
    Cc: Roman Zippel
    Cc: David Woodhouse
    Cc: Dave Kleikamp
    Cc: Trond Myklebust
    Cc: "J. Bruce Fields"
    Cc: Anton Altaparmakov
    Cc: Mark Fasheh
    Cc: Paul Mackerras
    Cc: Christoph Hellwig
    Cc: Jan Kara
    Cc: David Chinner
    Cc: "David S. Miller"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

09 May, 2007

4 commits

  • The problems are:

    - on filesystems w/o permanent inode numbers, i_ino values can be larger
    than 32 bits, which can cause problems for some 32 bit userspace programs on
    a 64 bit kernel. We can't do anything for filesystems that have actual
    >32-bit inode numbers, but on filesystems that generate i_ino values on the
    fly, we should try to have them fit in 32 bits. We could trivially fix this
    by making the static counters in new_inode and iunique 32 bits, but...

    - many filesystems call new_inode and assume that the i_ino values they are
    given are unique. They are not guaranteed to be so, since the static
    counter can wrap. This problem is exacerbated by the fix for #1.

    - after allocating a new inode, some filesystems call iunique to try to get
    a unique i_ino value, but they don't actually add their inodes to the
    hashtable, and so they're still not guaranteed to be unique if that counter
    wraps.

    This patch set takes the simpler approach of simply using iunique and hashing
    the inodes afterward. Christoph H. previously mentioned that he thought that
    this approach may slow down lookups for filesystems that currently hash their
    inodes.

    The questions are:

    1) how much would this slow down lookups for these filesystems?
    2) is it enough to justify adding more infrastructure to avoid it?

    What might be best is to start with this approach and then only move to using
    IDR or some other scheme if these extra inodes in the hashtable prove to be
    problematic.

    I've done some cursory testing with this patch and the overhead of hashing and
    unhashing the inodes with pipefs is pretty low -- just a few seconds of system
    time added on to the creation and destruction of 10 million pipes (very
    similar to the overhead that the IDR approach would add).

    The hard thing to measure is what effect this has on other filesystems. I'm
    open to ways to try and gauge this.

    Again, I've only converted pipefs as an example. If this approach is
    acceptable then I'll start work on patches to convert other filesystems.

    With a pretty-much-worst-case microbenchmark provided by Eric Dumazet
    :

    hashing patch (pipebench):
    sys 1m15.329s
    sys 1m16.249s
    sys 1m17.169s

    unpatched (pipebench):
    sys 1m9.836s
    sys 1m12.541s
    sys 1m14.153s

    Which works out to 1.05642174294555027017. So ~5-6% slowdown.

    This patch:

    When a 32-bit program that was not compiled with large file offsets does a
    stat and gets a st_ino value back that won't fit in the 32 bit field, glibc
    (correctly) generates an EOVERFLOW error. We can't do anything about fs's
    with larger permanent inode numbers, but when we generate them on the fly, we
    ought to try and have them fit within a 32 bit field.

    This patch takes the first step toward this by making the static counters in
    these two functions be 32 bits.

    [jlayton@redhat.com: mention that it's only the case for 32bit, non-LFS stat]
    Signed-off-by: Jeff Layton
    Cc: Christoph Hellwig
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeff Layton
     
  • There are many places in the kernel where the construction like

    foo = list_entry(head->next, struct foo_struct, list);

    are used.
    The code might look more descriptive and neat if using the macro

    list_first_entry(head, type, member) \
    list_entry((head)->next, type, member)

    Here is the macro itself and the examples of its usage in the generic code.
    If it will turn out to be useful, I can prepare the set of patches to
    inject in into arch-specific code, drivers, networking, etc.

    Signed-off-by: Pavel Emelianov
    Signed-off-by: Kirill Korotaev
    Cc: Randy Dunlap
    Cc: Andi Kleen
    Cc: Zach Brown
    Cc: Davide Libenzi
    Cc: John McCutchan
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: john stultz
    Cc: Ram Pai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelianov
     
  • A while back, Christoph mentioned that he thought that iunique ought to be
    cleaned up to use a more conventional loop construct. This patch does that,
    turning the strange goto loop into a do/while.

    Signed-off-by: Jeff Layton
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeffrey Layton
     
  • inode->i_sb is always set, not need to check for it.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     

08 May, 2007

1 commit

  • I have never seen a use of SLAB_DEBUG_INITIAL. It is only supported by
    SLAB.

    I think its purpose was to have a callback after an object has been freed
    to verify that the state is the constructor state again? The callback is
    performed before each freeing of an object.

    I would think that it is much easier to check the object state manually
    before the free. That also places the check near the code object
    manipulation of the object.

    Also the SLAB_DEBUG_INITIAL callback is only performed if the kernel was
    compiled with SLAB debugging on. If there would be code in a constructor
    handling SLAB_DEBUG_INITIAL then it would have to be conditional on
    SLAB_DEBUG otherwise it would just be dead code. But there is no such code
    in the kernel. I think SLUB_DEBUG_INITIAL is too problematic to make real
    use of, difficult to understand and there are easier ways to accomplish the
    same effect (i.e. add debug code before kfree).

    There is a related flag SLAB_CTOR_VERIFY that is frequently checked to be
    clear in fs inode caches. Remove the pointless checks (they would even be
    pointless without removeal of SLAB_DEBUG_INITIAL) from the fs constructors.

    This is the last slab flag that SLUB did not support. Remove the check for
    unimplemented flags from SLUB.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

13 Feb, 2007

2 commits

  • This patch is inspired by Arjan's "Patch series to mark struct
    file_operations and struct inode_operations const".

    Compile tested with gcc & sparse.

    Signed-off-by: Josef 'Jeff' Sipek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Josef 'Jeff' Sipek
     
  • Remove_dquot_ref can move to dqout.c instead of beeing in inode.c under
    #ifdef CONFIG_QUOTA. Also clean the resulting code up a tiny little bit by
    testing sb->dq_op earlier - it's constant over a filesystems lifetime.

    Signed-off-by: Christoph Hellwig
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     

12 Feb, 2007

3 commits

  • Convert all calls to invalidate_inode_pages() into open-coded calls to
    invalidate_mapping_pages().

    Leave the invalidate_inode_pages() wrapper in place for now, marked as
    deprecated.

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • When igrab() is calling __iget() on an inode it should check if
    clear_inode() has been called on the inode already. Otherwise there is a
    race window between clear_inode() and destroy_inode() where igrab() calls
    __iget() which leads to already free inodes on the inode lists.

    Signed-off-by: Vandana Rungta
    Signed-off-by: Jan Blunck
    Cc: Al Viro
    Cc: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Blunck
     
  • I added IS_NOATIME(inode) macro definition in include/linux/fs.h, true if
    the inode superblock is marked readonly or noatime.

    This new macro is then used in touch_atime() instead of separatly testing
    MS_RDONLY and MS_NOATIME

    Signed-off-by: Eric Dumazet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     

14 Dec, 2006

2 commits

  • Add "relatime" (relative atime) support. Relative atime only updates the
    atime if the previous atime is older than the mtime or ctime. Like
    noatime, but useful for applications like mutt that need to know when a
    file has been read since it was last modified.

    A corresponding patch against mount(8) is available at
    http://userweb.kernel.org/~akpm/mount-relative-atime.txt

    Signed-off-by: Valerie Henson
    Cc: Mark Fasheh
    Cc: Al Viro
    Cc: Christoph Hellwig
    Cc: Karel Zak
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Valerie Henson
     
  • Simplify touch_atime() layout.

    Cc: Valerie Henson
    Cc: Mark Fasheh
    Cc: Al Viro
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

09 Dec, 2006

1 commit

  • This patch changes struct file to use struct path instead of having
    independent pointers to struct dentry and struct vfsmount, and converts all
    users of f_{dentry,vfsmnt} in fs/ to use f_path.{dentry,mnt}.

    Additionally, it adds two #define's to make the transition easier for users of
    the f_dentry and f_vfsmnt.

    Signed-off-by: Josef "Jeff" Sipek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Josef "Jeff" Sipek
     

08 Dec, 2006

3 commits

  • Add a proper prototype for remove_inode_dquot_ref() in
    include/linux/quotaops.h

    Signed-off-by: Adrian Bunk
    Acked-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • Replace all uses of kmem_cache_t with struct kmem_cache.

    The patch was generated using the following script:

    #!/bin/sh
    #
    # Replace one string by another in all the kernel sources.
    #

    set -e

    for file in `find * -name "*.c" -o -name "*.h"|xargs grep -l $1`; do
    quilt add $file
    sed -e "1,\$s/$1/$2/g" $file >/tmp/$$
    mv /tmp/$$ $file
    quilt refresh
    done

    The script was run like this

    sh replace kmem_cache_t "struct kmem_cache"

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • SLAB_KERNEL is an alias of GFP_KERNEL.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

20 Oct, 2006

1 commit

  • The splice_actor may be calling ->prepare_write() and ->commit_write(). We
    want i_mutex on the inode being written to before calling those so that we
    don't race i_size changes.

    The double locking behavior is done elsewhere in splice.c, and if we
    eventually want _nolock variants of generic_file_splice_write(), fs modules
    might have to replicate the nasty locking code. We introduce
    inode_double_lock() and inode_double_unlock() to consolidate the locking
    rules into one set of functions.

    Signed-off-by: Mark Fasheh
    Signed-off-by: Jens Axboe

    Mark Fasheh
     

11 Oct, 2006

1 commit


02 Oct, 2006

1 commit

  • Only touch inode's i_mtime and i_ctime to make them equal to "now" in case
    they aren't yet (don't just update timestamp unconditionally). Uninline
    the hash function to save 259 Bytes.

    This tiny inode change which may improve cache behaviour also shaves off 8
    Bytes from file_update_time() on i386.

    Included a tiny codestyle cleanup, too.

    Signed-off-by: Andreas Mohr
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andreas Mohr