05 Mar, 2010

2 commits

  • Currently various places in the VFS call vfs_dq_init directly. This means
    we tie the quota code into the VFS. Get rid of that and make the
    filesystem responsible for the initialization. For most metadata operations
    this is a straight forward move into the methods, but for truncate and
    open it's a bit more complicated.

    For truncate we currently only call vfs_dq_init for the sys_truncate case
    because open already takes care of it for ftruncate and open(O_TRUNC) - the
    new code causes an additional vfs_dq_init for those which is harmless.

    For open the initialization is moved from do_filp_open into the open method,
    which means it happens slightly earlier now, and only for regular files.
    The latter is fine because we don't need to initialize it for operations
    on special files, and we already do it as part of the namespace operations
    for directories.

    Add a dquot_file_open helper that filesystems that support generic quotas
    can use to fill in ->open.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jan Kara

    Christoph Hellwig
     
  • Currently clear_inode calls vfs_dq_drop directly. This means
    we tie the quota code into the VFS. Get rid of that and make the
    filesystem responsible for the drop inside the ->clear_inode
    superblock operation.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jan Kara

    Christoph Hellwig
     

18 Dec, 2009

1 commit

  • After I_SYNC was split from I_LOCK the leftover is always used together with
    I_NEW and thus superflous.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     

25 Oct, 2009

1 commit


24 Sep, 2009

4 commits

  • Do a similar optimization as earlier for touch_atime. Getting the lock in
    mnt_get_write is relatively costly, so try all avenues to avoid it first.

    This patch is careful to still only update inode fields inside the lock
    region.

    This didn't show up in benchmarks, but it's easy enough to do.

    [akpm@linux-foundation.org: fix typo in comment]
    [hugh.dickins@tiscali.co.uk: fix inverted test of mnt_want_write_file()]
    Signed-off-by: Andi Kleen
    Cc: Christoph Hellwig
    Cc: Valerie Aurora
    Cc: Al Viro
    Cc: Dave Hansen
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Andi Kleen
     
  • Some benchmark testing shows touch_atime to be high up in profile logs for
    IO intensive workloads. Most likely that's due to the lock in
    mnt_want_write(). Unfortunately touch_atime first takes the lock, and
    then does all the other tests that could avoid atime updates (like noatime
    or relatime).

    Do it the other way round -- first try to avoid the update and only then
    if that didn't succeed take the lock. That works because none of the
    atime avoidance tests rely on locking.

    This also eliminates a goto.

    Signed-off-by: Andi Kleen
    Cc: Christoph Hellwig
    Reviewed-by: Valerie Aurora
    Cc: Al Viro
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Andi Kleen
     
  • Hugetlbfs needs to do special things instead of truncate_inode_pages().
    Currently, it copied generic_forget_inode() except for
    truncate_inode_pages() call which is asking for trouble (the code there
    isn't trivial). So create a separate function generic_detach_inode()
    which does all the list magic done in generic_forget_inode() and call
    it from hugetlbfs_forget_inode().

    Signed-off-by: Jan Kara
    Cc: Al Viro
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Jan Kara
     
  • Add device-id and inode number for better debugging. This was suggested
    by Andreas in one of the threads
    http://article.gmane.org/gmane.comp.file-systems.ext4/12062 .

    "If anyone has a chance, fixing this error message to be not-useless would
    be good... Including the device name and the inode number would help
    track down the source of the problem."

    Signed-off-by: Manish Katiyar
    Cc: Andreas Dilger
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Manish Katiyar
     

23 Sep, 2009

1 commit

  • We have had a report of bad memory allocation latency during DVD-RAM (UDF)
    writing. This is causing the user's desktop session to become unusable.

    Jan tracked the cause of this down to UDF inode reclaim blocking:

    gnome-screens D ffff810006d1d598 0 20686 1
    ffff810006d1d508 0000000000000082 ffff810037db6718 0000000000000800
    ffff810006d1d488 ffffffff807e4280 ffffffff807e4280 ffff810006d1a580
    ffff8100bccbc140 ffff810006d1a8c0 0000000006d1d4e8 ffff810006d1a8c0
    Call Trace:
    [] io_schedule+0x63/0xa5
    [] sync_buffer+0x3b/0x3f
    [] __wait_on_bit+0x47/0x79
    [] out_of_line_wait_on_bit+0x6a/0x77
    [] __wait_on_buffer+0x1f/0x21
    [] __bread+0x70/0x86
    [] :udf:udf_tread+0x38/0x3a
    [] :udf:udf_update_inode+0x4d/0x68c
    [] :udf:udf_write_inode+0x1d/0x2b
    [] __writeback_single_inode+0x1c0/0x394
    [] write_inode_now+0x7d/0xc4
    [] :udf:udf_clear_inode+0x3d/0x53
    [] clear_inode+0xc2/0x11b
    [] dispose_list+0x5b/0x102
    [] shrink_icache_memory+0x1dd/0x213
    [] shrink_slab+0xe3/0x158
    [] try_to_free_pages+0x177/0x232
    [] __alloc_pages+0x1fa/0x392
    [] alloc_page_vma+0x176/0x189
    [] __do_fault+0x10c/0x417
    [] handle_mm_fault+0x466/0x940
    [] do_page_fault+0x676/0xabf

    This blocks with iprune_mutex held, which then blocks other reclaimers:

    X D ffff81009d47c400 0 17285 14831
    ffff8100844f3728 0000000000000086 0000000000000000 ffff81000000e288
    ffff81000000da00 ffffffff807e4280 ffffffff807e4280 ffff81009d47c400
    ffffffff805ff890 ffff81009d47c740 00000000844f3808 ffff81009d47c740
    Call Trace:
    [] __mutex_lock_slowpath+0x72/0xa9
    [] mutex_lock+0x1e/0x22
    [] shrink_icache_memory+0x49/0x213
    [] shrink_slab+0xe3/0x158
    [] try_to_free_pages+0x177/0x232
    [] __alloc_pages+0x1fa/0x392
    [] alloc_pages_current+0xd1/0xd6
    [] __get_free_pages+0xe/0x4d
    [] __pollwait+0x5e/0xdf
    [] :nvidia:nv_kern_poll+0x2e/0x73
    [] do_select+0x308/0x506
    [] core_sys_select+0x1a6/0x254
    [] sys_select+0xb5/0x157

    Now I think the main problem is having the filesystem block (and do IO) in
    inode reclaim. The problem is that this doesn't get accounted well and
    penalizes a random allocator with a big latency spike caused by work
    generated from elsewhere.

    I think the best idea would be to avoid this. By design if possible, or
    by deferring the hard work to an asynchronous context. If the latter,
    then the fs would probably want to throttle creation of new work with
    queue size of the deferred work, but let's not get into those details.

    Anyway, the other obvious thing we looked at is the iprune_mutex which is
    causing the cascading blocking. We could turn this into an rwsem to
    improve concurrency. It is unreasonable to totally ban all potentially
    slow or blocking operations in inode reclaim, so I think this is a cheap
    way to get a small improvement.

    This doesn't solve the whole problem of course. The process doing inode
    reclaim will still take the latency hit, and concurrent processes may end
    up contending on filesystem locks. So fs developers should keep these
    problems in mind.

    Signed-off-by: Nick Piggin
    Cc: Jan Kara
    Cc: Al Viro
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

22 Sep, 2009

2 commits


16 Sep, 2009

1 commit

  • It has been unused since it was introduced in:

    commit 520808bf20e90fdbdb320264ba7dd5cf9d47dcac
    Author: Andrew Morton
    Date: Fri May 21 00:46:17 2004 -0700

    [PATCH] block device layer: separate backing_dev_info infrastructure

    So lets just kill it.

    Acked-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jens Axboe
     

08 Aug, 2009

2 commits

  • When we want to tear down an inode that lost the add to the cache race
    in XFS we must not call into ->destroy_inode because that would delete
    the inode that won the race from the inode cache radix tree.

    This patch provides the __destroy_inode helper needed to fix this,
    the actual fix will be in th next patch. As XFS was the only reason
    destroy_inode was exported we shift the export to the new __destroy_inode.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Eric Sandeen

    Christoph Hellwig
     
  • Currently inode_init_always calls into ->destroy_inode if the additional
    initialization fails. That's not only counter-intuitive because
    inode_init_always did not allocate the inode structure, but in case of
    XFS it's actively harmful as ->destroy_inode might delete the inode from
    a radix-tree that has never been added. This in turn might end up
    deleting the inode for the same inum that has been instanciated by
    another process and cause lots of cause subtile problems.

    Also in the case of re-initializing a reclaimable inode in XFS it would
    free an inode we still want to keep alive.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Eric Sandeen

    Christoph Hellwig
     

25 Jun, 2009

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (23 commits)
    switch xfs to generic acl caching helpers
    helpers for acl caching + switch to those
    switch shmem to inode->i_acl
    switch reiserfs to inode->i_acl
    switch reiserfs to usual conventions for caching ACLs
    reiserfs: minimal fix for ACL caching
    switch nilfs2 to inode->i_acl
    switch btrfs to inode->i_acl
    switch jffs2 to inode->i_acl
    switch jfs to inode->i_acl
    switch ext4 to inode->i_acl
    switch ext3 to inode->i_acl
    switch ext2 to inode->i_acl
    add caching of ACLs in struct inode
    fs: Add new pre-allocation ioctls to vfs for compatibility with legacy xfs ioctls
    cleanup __writeback_single_inode
    ... and the same for vfsmount id/mount group id
    Make allocation of anon devices cheaper
    update Documentation/filesystems/Locking
    devpts: remove module-related code
    ...

    Linus Torvalds
     

24 Jun, 2009

1 commit


23 Jun, 2009

1 commit

  • Some filesystems need to set lockdep map for i_mutex differently for
    different directories. For example OCFS2 has system directories (for
    orphan inode tracking and for gathering all system files like journal
    or quota files into a single place) which have different locking
    locking rules than standard directories. For a filesystem setting
    lockdep map is naturaly done when the inode is read but we have to
    modify unlock_new_inode() not to overwrite the lockdep map the filesystem
    has set.

    Acked-by: peterz@infradead.org
    CC: mingo@redhat.com
    Signed-off-by: Jan Kara
    Signed-off-by: Joel Becker

    Jan Kara
     

13 Jun, 2009

1 commit


12 Jun, 2009

3 commits

  • This patch speeds up lmbench lat_mmap test by about another 2% after the
    first patch.

    Before:
    avg = 462.286
    std = 5.46106

    After:
    avg = 453.12
    std = 9.58257

    (50 runs of each, stddev gives a reasonable confidence)

    It does this by introducing mnt_clone_write, which avoids some heavyweight
    operations of mnt_want_write if called on a vfsmount which we know already
    has a write count; and mnt_want_write_file, which can call mnt_clone_write
    if the file is open for write.

    After these two patches, mnt_want_write and mnt_drop_write go from 7% on
    the profile down to 1.3% (including mnt_clone_write).

    [AV: mnt_want_write_file() should take file alone and derive mnt from it;
    not only all callers have that form, but that's the only mnt about which
    we know that it's already held for write if file is opened for write]

    Cc: Dave Hansen
    Signed-off-by: Nick Piggin
    Signed-off-by: Al Viro

    npiggin@suse.de
     
  • When an fs is unmounted with an fsnotify mark entry attached to one of its
    inodes we need to destroy that mark entry and we also (like inotify) send
    an unmount event.

    Signed-off-by: Eric Paris
    Acked-by: Al Viro
    Cc: Christoph Hellwig

    Eric Paris
     
  • This patch creates a way for fsnotify groups to attach marks to inodes.
    These marks have little meaning to the generic fsnotify infrastructure
    and thus their meaning should be interpreted by the group that attached
    them to the inode's list.

    dnotify and inotify will make use of these markings to indicate which
    inodes are of interest to their respective groups. But this implementation
    has the useful property that in the future other listeners could actually
    use the marks for the exact opposite reason, aka to indicate which inodes
    it had NO interest in.

    Signed-off-by: Eric Paris
    Acked-by: Al Viro
    Cc: Christoph Hellwig

    Eric Paris
     

07 Jun, 2009

1 commit

  • CONFIG_IMA=y inode activity leaks iint_cache and radix_tree_node objects
    until the system runs out of memory. Nowhere is calling ima_inode_free()
    a.k.a. ima_iint_delete(). Fix that by calling it from destroy_inode().

    Signed-off-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

06 Jun, 2009

1 commit

  • OK, that's probably the easiest way to do that, as much as I don't like it...
    Since iget() et.al. will not accept I_FREEING (will wait to go away
    and restart), and since we'd better have serialization between new/free
    on fs data structures anyway, we can afford simply skipping I_FREEING
    et.al. in insert_inode_locked().

    We do that from new_inode, so it won't race with free_inode in any interesting
    ways and it won't race with iget (of any origin; nfsd or in case of fs
    corruption a lookup) since both still will wait for I_LOCK.

    Reviewed-by: "Theodore Ts'o"
    Acked-by: Jan Kara
    Tested-by: David Watson
    Signed-off-by: Al Viro

    Al Viro
     

09 May, 2009

1 commit


15 Apr, 2009

1 commit

  • There are lots of sequences like this, especially in splice code:

    if (pipe->inode)
    mutex_lock(&pipe->inode->i_mutex);
    /* do something */
    if (pipe->inode)
    mutex_unlock(&pipe->inode->i_mutex);

    so introduce helpers which do the conditional locking and unlocking.
    Also replace the inode_double_lock() call with a pipe_double_lock()
    helper to avoid spreading the use of this functionality beyond the
    pipe code.

    This patch is just a cleanup, and should cause no behavioral changes.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Jens Axboe

    Miklos Szeredi
     

28 Mar, 2009

3 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (37 commits)
    fs: avoid I_NEW inodes
    Merge code for single and multiple-instance mounts
    Remove get_init_pts_sb()
    Move common mknod_ptmx() calls into caller
    Parse mount options just once and copy them to super block
    Unroll essentials of do_remount_sb() into devpts
    vfs: simple_set_mnt() should return void
    fs: move bdev code out of buffer.c
    constify dentry_operations: rest
    constify dentry_operations: configfs
    constify dentry_operations: sysfs
    constify dentry_operations: JFS
    constify dentry_operations: OCFS2
    constify dentry_operations: GFS2
    constify dentry_operations: FAT
    constify dentry_operations: FUSE
    constify dentry_operations: procfs
    constify dentry_operations: ecryptfs
    constify dentry_operations: CIFS
    constify dentry_operations: AFS
    ...

    Linus Torvalds
     
  • * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-quota-2.6: (27 commits)
    ext2: Zero our b_size in ext2_quota_read()
    trivial: fix typos/grammar errors in fs/Kconfig
    quota: Coding style fixes
    quota: Remove superfluous inlines
    quota: Remove uppercase aliases for quota functions.
    nfsd: Use lowercase names of quota functions
    jfs: Use lowercase names of quota functions
    udf: Use lowercase names of quota functions
    ufs: Use lowercase names of quota functions
    reiserfs: Use lowercase names of quota functions
    ext4: Use lowercase names of quota functions
    ext3: Use lowercase names of quota functions
    ext2: Use lowercase names of quota functions
    ramfs: Remove quota call
    vfs: Use lowercase names of quota functions
    quota: Remove dqbuf_t and other cleanups
    quota: Remove NODQUOT macro
    quota: Make global quota locks cacheline aligned
    quota: Move quota files into separate directory
    ext4: quota reservation for delayed allocation
    ...

    Linus Torvalds
     
  • To be on the safe side, it should be less fragile to exclude I_NEW inodes
    from inode list scans by default (unless there is an important reason to
    have them).

    Normally they will get excluded (eg. by zero refcount or writecount etc),
    however it is a bit fragile for list walkers to know exactly what parts of
    the inode state is set up and valid to test when in I_NEW. So along these
    lines, move I_NEW checks upward as well (sometimes taking I_FREEING etc
    checks with them too -- this shouldn't be a problem should it?)

    Signed-off-by: Nick Piggin
    Acked-by: Jan Kara
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Nick Piggin
     

27 Mar, 2009

2 commits

  • …s/security-testing-2.6

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: (71 commits)
    SELinux: inode_doinit_with_dentry drop no dentry printk
    SELinux: new permission between tty audit and audit socket
    SELinux: open perm for sock files
    smack: fixes for unlabeled host support
    keys: make procfiles per-user-namespace
    keys: skip keys from another user namespace
    keys: consider user namespace in key_permission
    keys: distinguish per-uid keys in different namespaces
    integrity: ima iint radix_tree_lookup locking fix
    TOMOYO: Do not call tomoyo_realpath_init unless registered.
    integrity: ima scatterlist bug fix
    smack: fix lots of kernel-doc notation
    TOMOYO: Don't create securityfs entries unless registered.
    TOMOYO: Fix exception policy read failure.
    SELinux: convert the avc cache hash list to an hlist
    SELinux: code readability with avc_cache
    SELinux: remove unused av.decided field
    SELinux: more careful use of avd in avc_has_perm_noaudit
    SELinux: remove the unused ae.used
    SELinux: check seqno when updating an avc_node
    ...

    Linus Torvalds
     
  • Allow atime to be updated once per day even with relatime. This lets
    utilities like tmpreaper (which delete files based on last access time)
    continue working, making relatime a plausible default for distributions.

    Signed-off-by: Matthew Garrett
    Reviewed-by: Matthew Wilcox
    Acked-by: Valerie Aurora Henson
    Acked-by: Alan Cox
    Acked-by: Ingo Molnar
    Signed-off-by: Linus Torvalds

    Matthew Garrett
     

26 Mar, 2009

1 commit


24 Mar, 2009

1 commit


13 Mar, 2009

1 commit

  • There was a report of a data corruption
    http://lkml.org/lkml/2008/11/14/121. There is a script included to
    reproduce the problem.

    During testing, I encountered a number of strange things with ext3, so I
    tried ext2 to attempt to reduce complexity of the problem. I found that
    fsstress would quickly hang in wait_on_inode, waiting for I_LOCK to be
    cleared, even though instrumentation showed that unlock_new_inode had
    already been called for that inode. This points to memory scribble, or
    synchronisation problme.

    i_state of I_NEW inodes is not protected by inode_lock because other
    processes are not supposed to touch them until I_LOCK (and I_NEW) is
    cleared. Adding WARN_ON(inode->i_state & I_NEW) to sites where we modify
    i_state revealed that generic_sync_sb_inodes is picking up new inodes from
    the inode lists and passing them to __writeback_single_inode without
    waiting for I_NEW. Subsequently modifying i_state causes corruption. In
    my case it would look like this:

    CPU0 CPU1
    unlock_new_inode() __sync_single_inode()
    reg i_state
    reg -> reg & ~(I_LOCK|I_NEW) reg i_state
    reg -> inode->i_state reg -> reg | I_SYNC
    reg -> inode->i_state

    Non-atomic RMW on CPU1 overwrites CPU0 store and sets I_LOCK|I_NEW again.

    Fix for this is rather than wait for I_NEW inodes, just skip over them:
    inodes concurrently being created are not subject to data integrity
    operations, and should not significantly contribute to dirty memory
    either.

    After this change, I'm unable to reproduce any of the added warnings or
    hangs after ~1hour of running. Previously, the new warnings would start
    immediately and hang would happen in under 5 minutes.

    I'm also testing on ext3 now, and so far no problems there either. I
    don't know whether this fixes the problem reported above, but it fixes a
    real problem for me.

    Cc: "Jorge Boncompte [DTI2]"
    Reported-by: Adrian Hunter
    Cc: Jan Kara
    Cc:
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

06 Feb, 2009

2 commits

  • Conflicts:
    fs/namei.c

    Manually merged per:

    diff --cc fs/namei.c
    index 734f2b5,bbc15c2..0000000
    --- a/fs/namei.c
    +++ b/fs/namei.c
    @@@ -860,9 -848,8 +849,10 @@@ static int __link_path_walk(const char
    nd->flags |= LOOKUP_CONTINUE;
    err = exec_permission_lite(inode);
    if (err == -EAGAIN)
    - err = vfs_permission(nd, MAY_EXEC);
    + err = inode_permission(nd->path.dentry->d_inode,
    + MAY_EXEC);
    + if (!err)
    + err = ima_path_check(&nd->path, MAY_EXEC);
    if (err)
    break;

    @@@ -1525,14 -1506,9 +1509,14 @@@ int may_open(struct path *path, int acc
    flag &= ~O_TRUNC;
    }

    - error = vfs_permission(nd, acc_mode);
    + error = inode_permission(inode, acc_mode);
    if (error)
    return error;
    +
    - error = ima_path_check(&nd->path,
    ++ error = ima_path_check(path,
    + acc_mode & (MAY_READ | MAY_WRITE | MAY_EXEC));
    + if (error)
    + return error;
    /*
    * An append-only file must be opened in append mode for writing.
    */

    Signed-off-by: James Morris

    James Morris
     
  • This patch replaces the generic integrity hooks, for which IMA registered
    itself, with IMA integrity hooks in the appropriate places directly
    in the fs directory.

    Signed-off-by: Mimi Zohar
    Acked-by: Serge Hallyn
    Signed-off-by: James Morris

    Mimi Zohar
     

10 Jan, 2009

1 commit


08 Jan, 2009

1 commit


07 Jan, 2009

2 commits

  • Fix kernel-doc notation:

    Warning(linux-2.6.28-git3//fs/inode.c:120): No description found for parameter 'sb'
    Warning(linux-2.6.28-git3//fs/inode.c:120): No description found for parameter 'inode'
    Warning(linux-2.6.28-git3//fs/inode.c:588): No description found for parameter 'sb'
    Warning(linux-2.6.28-git3//fs/inode.c:588): No description found for parameter 'inode'

    Signed-off-by: Randy Dunlap
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • GFP_HIGHUSER_PAGECACHE is just an alias for GFP_HIGHUSER_MOVABLE, making
    that harder to track down: remove it, and its out-of-work brothers
    GFP_NOFS_PAGECACHE and GFP_USER_PAGECACHE.

    Since we're making that improvement to hotremove_migrate_alloc(), I think
    we can now also remove one of the "o"s from its comment.

    Signed-off-by: Hugh Dickins
    Acked-by: Mel Gorman
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

06 Jan, 2009

1 commit