28 Mar, 2011

1 commit


25 Mar, 2011

8 commits

  • Merge get_new_inode/get_new_inode_fast into iget5_locked/iget_locked
    as those were the only callers. Remove the internal ifind/ifind_fast
    helpers - ifind_fast only had a single caller, and ifind had two
    callers wanting it to do different things. Also clean up the comments
    in this area to focus on information important to a developer trying
    to use it, instead of overloading them with implementation details.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • All that remains of the inode_lock is protecting the inode hash list
    manipulation and traversals. Rename the inode_lock to
    inode_hash_lock to reflect it's actual function.

    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Dave Chinner
     
  • Protect the inode writeback list with a new global lock
    inode_wb_list_lock and use it to protect the list manipulations and
    traversals. This lock replaces the inode_lock as the inodes on the
    list can be validity checked while holding the inode->i_lock and
    hence the inode_lock is no longer needed to protect the list.

    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Dave Chinner
     
  • Protect the per-sb inode list with a new global lock
    inode_sb_list_lock and use it to protect the list manipulations and
    traversals. This lock replaces the inode_lock as the inodes on the
    list can be validity checked while holding the inode->i_lock and
    hence the inode_lock is no longer needed to protect the list.

    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Dave Chinner
     
  • Now that inode state changes are protected by the inode->i_lock and
    the inode LRU manipulations by the inode_lru_lock, we can remove the
    inode_lock from prune_icache and the initial part of iput_final().

    instead of using the inode_lock to protect the inode during
    iput_final, use the inode->i_lock instead. This protects the inode
    against new references being taken while we change the inode state
    to I_FREEING, as well as preventing prune_icache from grabbing the
    inode while we are manipulating it. Hence we no longer need the
    inode_lock in iput_final prior to setting I_FREEING on the inode.

    For prune_icache, we no longer need the inode_lock to protect the
    LRU list, and the inodes themselves are protected against freeing
    races by the inode->i_lock. Hence we can lift the inode_lock from
    prune_icache as well.

    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Dave Chinner
     
  • Introduce the inode_lru_lock to protect the inode_lru list. This
    lock is nested inside the inode->i_lock to allow the inode to be
    added to the LRU list in iput_final without needing to deal with
    lock inversions. This keeps iput_final() clean and neat.

    Further, where marking the inode I_FREEING and removing it from the
    LRU, move the LRU list manipulation within the inode->i_lock to keep
    the list manipulation consistent with iput_final. This also means
    that most of the open coded LRU list removal + unused inode
    accounting can now use the inode_lru_list_del() wrappers which
    cleans the code up further.

    However, this locking change means what the LRU traversal in
    prune_icache() inverts this lock ordering and needs to use trylock
    semantics on the inode->i_lock to avoid deadlocking. In these cases,
    if we fail to lock the inode we move it to the back of the LRU to
    prevent spinning on it.

    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Dave Chinner
     
  • We have a couple of places that dispose of inodes. factor the
    disposal into evict() to isolate this code and make it simpler to
    peel away the inode_lock from the code.

    While doing this, change the logic flow in iput_final() to separate
    the different cases that need to be handled to make the transitions
    the inode goes through more obvious.

    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Dave Chinner
     
  • Protect inode state transitions and validity checks with the
    inode->i_lock. This enables us to make inode state transitions
    independently of the inode_lock and is the first step to peeling
    away the inode_lock from the code.

    This requires that __iget() is done atomically with i_state checks
    during list traversals so that we don't race with another thread
    marking the inode I_FREEING between the state check and grabbing the
    reference.

    Also remove the unlock_new_inode() memory barrier optimisation
    required to avoid taking the inode_lock when clearing I_NEW.
    Simplify the code by simply taking the inode->i_lock around the
    state change and wakeup. Because the wakeup is no longer tricky,
    remove the wake_up_inode() function and open code the wakeup where
    necessary.

    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Dave Chinner
     

24 Mar, 2011

2 commits

  • And give it a kernel-doc comment.

    [akpm@linux-foundation.org: btrfs changed in linux-next]
    Signed-off-by: Serge E. Hallyn
    Cc: "Eric W. Biederman"
    Cc: Daniel Lezcano
    Acked-by: David Howells
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     
  • Cheat for now and say all files belong to init_user_ns. Next step will be
    to let superblocks belong to a user_ns, and derive inode_userns(inode)
    from inode->i_sb->s_user_ns. Finally we'll introduce more flexible
    arrangements.

    Changelog:
    Feb 15: make is_owner_or_cap take const struct inode
    Feb 23: make is_owner_or_cap bool

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Serge E. Hallyn
    Acked-by: "Eric W. Biederman"
    Acked-by: Daniel Lezcano
    Acked-by: David Howells
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     

21 Mar, 2011

1 commit


16 Mar, 2011

1 commit

  • iprune_sem is continously giving us lockdep warnings because we do take it in
    read mode in the reclaim path, but we're also doing non-NOFS allocations under
    it taken in write mode.

    Taking a bit deeper look at it I think it's fixable quite trivially:

    - for invalidate_inodes we do not need iprune_sem at all. We have an active
    reference on the superblock, so the filesystem is not going away until it
    has finished.
    - for evict_inodes we do need it, to make sure prune_icache has done it's
    work before we tear down the superblock. But there is no reason to
    hold it over the actual reclaim operation - it's enough to cycle through
    it after the actual reclaim to make sure we wait for any pending
    prune_icache to complete. We just have to remove the WARN_ON for
    otherwise busy inodes as they can actually happen now.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     

26 Feb, 2011

1 commit

  • * 'for-linus' of git://neil.brown.name/md:
    md: Fix - again - partition detection when array becomes active
    Fix over-zealous flush_disk when changing device size.
    md: avoid spinlock problem in blk_throtl_exit
    md: correctly handle probe of an 'mdp' device.
    md: don't set_capacity before array is active.
    md: Fix raid1->raid0 takeover

    Linus Torvalds
     

24 Feb, 2011

2 commits

  • There are two cases when we call flush_disk.
    In one, the device has disappeared (check_disk_change) so any
    data will hold becomes irrelevant.
    In the oter, the device has changed size (check_disk_size_change)
    so data we hold may be irrelevant.

    In both cases it makes sense to discard any 'clean' buffers,
    so they will be read back from the device if needed.

    In the former case it makes sense to discard 'dirty' buffers
    as there will never be anywhere safe to write the data. In the
    second case it *does*not* make sense to discard dirty buffers
    as that will lead to file system corruption when you simply enlarge
    the containing devices.

    flush_disk calls __invalidate_devices.
    __invalidate_device calls both invalidate_inodes and invalidate_bdev.

    invalidate_inodes *does* discard I_DIRTY inodes and this does lead
    to fs corruption.

    invalidate_bev *does*not* discard dirty pages, but I don't really care
    about that at present.

    So this patch adds a flag to __invalidate_device (calling it
    __invalidate_device2) to indicate whether dirty buffers should be
    killed, and this is passed to invalidate_inodes which can choose to
    skip dirty inodes.

    flusk_disk then passes true from check_disk_change and false from
    check_disk_size_change.

    dm avoids tripping over this problem by calling i_size_write directly
    rathher than using check_disk_size_change.

    md does use check_disk_size_change and so is affected.

    This regression was introduced by commit 608aeef17a which causes
    check_disk_size_change to call flush_disk, so it is suitable for any
    kernel since 2.6.27.

    Cc: stable@kernel.org
    Acked-by: Jeff Moyer
    Cc: Andrew Patterson
    Cc: Jens Axboe
    Signed-off-by: NeilBrown

    NeilBrown
     
  • Michael Leun reported that running parallel opens on a fuse filesystem
    can trigger a "kernel BUG at mm/truncate.c:475"

    Gurudas Pai reported the same bug on NFS.

    The reason is, unmap_mapping_range() is not prepared for more than
    one concurrent invocation per inode. For example:

    thread1: going through a big range, stops in the middle of a vma and
    stores the restart address in vm_truncate_count.

    thread2: comes in with a small (e.g. single page) unmap request on
    the same vma, somewhere before restart_address, finds that the
    vma was already unmapped up to the restart address and happily
    returns without doing anything.

    Another scenario would be two big unmap requests, both having to
    restart the unmapping and each one setting vm_truncate_count to its
    own value. This could go on forever without any of them being able to
    finish.

    Truncate and hole punching already serialize with i_mutex. Other
    callers of unmap_mapping_range() do not, and it's difficult to get
    i_mutex protection for all callers. In particular ->d_revalidate(),
    which calls invalidate_inode_pages2_range() in fuse, may be called
    with or without i_mutex.

    This patch adds a new mutex to 'struct address_space' to prevent
    running multiple concurrent unmap_mapping_range() on the same mapping.

    [ We'll hopefully get rid of all this with the upcoming mm
    preemptibility series by Peter Zijlstra, the "mm: Remove i_mmap_mutex
    lockbreak" patch in particular. But that is for 2.6.39 ]

    Signed-off-by: Miklos Szeredi
    Reported-by: Michael Leun
    Reported-by: Gurudas Pai
    Tested-by: Gurudas Pai
    Acked-by: Hugh Dickins
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     

07 Jan, 2011

4 commits

  • Pseudo filesystems that don't put inode on RCU list or reachable by
    rcu-walk dentries do not need to RCU free their inodes.

    Signed-off-by: Nick Piggin

    Nick Piggin
     
  • RCU free the struct inode. This will allow:

    - Subsequent store-free path walking patch. The inode must be consulted for
    permissions when walking, so an RCU inode reference is a must.
    - sb_inode_list_lock to be moved inside i_lock because sb list walkers who want
    to take i_lock no longer need to take sb_inode_list_lock to walk the list in
    the first place. This will simplify and optimize locking.
    - Could remove some nested trylock loops in dcache code
    - Could potentially simplify things a bit in VM land. Do not need to take the
    page lock to follow page->mapping.

    The downsides of this is the performance cost of using RCU. In a simple
    creat/unlink microbenchmark, performance drops by about 10% due to inability to
    reuse cache-hot slab objects. As iterations increase and RCU freeing starts
    kicking over, this increases to about 20%.

    In cases where inode lifetimes are longer (ie. many inodes may be allocated
    during the average life span of a single inode), a lot of this cache reuse is
    not applicable, so the regression caused by this patch is smaller.

    The cache-hot regression could largely be avoided by using SLAB_DESTROY_BY_RCU,
    however this adds some complexity to list walking and store-free path walking,
    so I prefer to implement this at a later date, if it is shown to be a win in
    real situations. I haven't found a regression in any non-micro benchmark so I
    doubt it will be a problem.

    Signed-off-by: Nick Piggin

    Nick Piggin
     
  • percpu_counter library generates quite nasty code, so unless you need
    to dynamically allocate counters or take fast approximate value, a
    simple per cpu set of counters is much better.

    The percpu_counter can never be made to work as well, because it has an
    indirection from pointer to percpu memory, and it can't use direct
    this_cpu_inc interfaces because it doesn't use static PER_CPU data, so
    code will always be worse.

    In the fastpath, it is the difference between this:

    incl %gs:nr_dentry # nr_dentry

    and this:

    movl percpu_counter_batch(%rip), %edx # percpu_counter_batch,
    movl $1, %esi #,
    movq $nr_dentry, %rdi #,
    call __percpu_counter_add # (plus I clobber registers)

    __percpu_counter_add:
    pushq %rbp #
    movq %rsp, %rbp #,
    subq $32, %rsp #,
    movq %rbx, -24(%rbp) #,
    movq %r12, -16(%rbp) #,
    movq %r13, -8(%rbp) #,
    movq %rdi, %rbx # fbc, fbc
    #APP
    # 216 "/home/npiggin/usr/src/linux-2.6/arch/x86/include/asm/thread_info.h" 1
    movq %gs:kernel_stack,%rax #, pfo_ret__
    # 0 "" 2
    #NO_APP
    incl -8124(%rax) # .preempt_count
    movq 32(%rdi), %r12 # .counters, tcp_ptr__
    #APP
    # 78 "lib/percpu_counter.c" 1
    add %gs:this_cpu_off, %r12 # this_cpu_off, tcp_ptr__
    # 0 "" 2
    #NO_APP
    movslq (%r12),%r13 #* tcp_ptr__, tmp73
    movslq %edx,%rax # batch, batch
    addq %rsi, %r13 # amount, count
    cmpq %rax, %r13 # batch, count
    jge .L27 #,
    negl %edx # tmp76
    movslq %edx,%rdx # tmp76, tmp77
    cmpq %rdx, %r13 # tmp77, count
    jg .L28 #,
    .L27:
    movq %rbx, %rdi # fbc,
    call _raw_spin_lock #
    addq %r13, 8(%rbx) # count, .count
    movq %rbx, %rdi # fbc,
    movl $0, (%r12) #,* tcp_ptr__
    call _raw_spin_unlock #
    .L29:
    #APP
    # 216 "/home/npiggin/usr/src/linux-2.6/arch/x86/include/asm/thread_info.h" 1
    movq %gs:kernel_stack,%rax #, pfo_ret__
    # 0 "" 2
    #NO_APP
    decl -8124(%rax) # .preempt_count
    movq -8136(%rax), %rax #, D.14625
    testb $8, %al #, D.14625
    jne .L32 #,
    .L31:
    movq -24(%rbp), %rbx #,
    movq -16(%rbp), %r12 #,
    movq -8(%rbp), %r13 #,
    leave
    ret
    .p2align 4,,10
    .p2align 3
    .L28:
    movl %r13d, (%r12) # count,*
    jmp .L29 #
    .L32:
    call preempt_schedule #
    .p2align 4,,6
    jmp .L31 #
    .size __percpu_counter_add, .-__percpu_counter_add
    .p2align 4,,15

    Signed-off-by: Nick Piggin

    Nick Piggin
     
  • The nr_unused counters count the number of objects on an LRU, and as such they
    are synchronized with LRU object insertion and removal and scanning, and
    protected under the LRU lock.

    Making it per-cpu does not actually get any concurrency improvements because of
    this lock, and summing the counter is much slower, and
    incrementing/decrementing it costs more code size and is slower too.

    These counters should stay per-LRU, which currently means global.

    Signed-off-by: Nick Piggin

    Nick Piggin
     

27 Oct, 2010

2 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (52 commits)
    split invalidate_inodes()
    fs: skip I_FREEING inodes in writeback_sb_inodes
    fs: fold invalidate_list into invalidate_inodes
    fs: do not drop inode_lock in dispose_list
    fs: inode split IO and LRU lists
    fs: switch bdev inode bdi's correctly
    fs: fix buffer invalidation in invalidate_list
    fsnotify: use dget_parent
    smbfs: use dget_parent
    exportfs: use dget_parent
    fs: use RCU read side protection in d_validate
    fs: clean up dentry lru modification
    fs: split __shrink_dcache_sb
    fs: improve DCACHE_REFERENCED usage
    fs: use percpu counter for nr_dentry and nr_dentry_unused
    fs: simplify __d_free
    fs: take dcache_lock inside __d_path
    fs: do not assign default i_ino in new_inode
    fs: introduce a per-cpu last_ino allocator
    new helper: ihold()
    ...

    Linus Torvalds
     
  • IMA currently allocated an inode integrity structure for every inode in
    core. This stucture is about 120 bytes long. Most files however
    (especially on a system which doesn't make use of IMA) will never need
    any of this space. The problem is that if IMA is enabled we need to
    know information about the number of readers and the number of writers
    for every inode on the box. At the moment we collect that information
    in the per inode iint structure and waste the rest of the space. This
    patch moves those counters into the struct inode so we can eventually
    stop allocating an IMA integrity structure except when absolutely
    needed.

    This patch does the minimum needed to move the location of the data.
    Further cleanups, especially the location of counter updates, may still
    be possible.

    Signed-off-by: Eric Paris
    Acked-by: Mimi Zohar
    Signed-off-by: Linus Torvalds

    Eric Paris
     

26 Oct, 2010

18 commits

  • Pull removal of fsnotify marks into generic_shutdown_super().
    Split umount-time work into a new function - evict_inodes().
    Make sure that invalidate_inodes() will be able to cope with
    I_FREEING once we change locking in iput().

    Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Despite the comment above it we can not safely drop the lock here.
    invalidate_list is called from many other places that just umount.
    Also switch to proper list macros now that we never drop the lock.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • The use of the same inode list structure (inode->i_list) for two
    different list constructs with different lifecycles and purposes
    makes it impossible to separate the locking of the different
    operations. Therefore, to enable the separation of the locking of
    the writeback and reclaim lists, split the inode->i_list into two
    separate lists dedicated to their specific tracking functions.

    Signed-off-by: Nick Piggin
    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Nick Piggin
     
  • We must not call invalidate_inode_buffers in invalidate_list unless the
    inode can be reclaimed. If we remove the buffer association of a busy
    inode fsync won't find the buffers anymore. As invalidate_inode_buffers
    is called from various others sources than umount this actually does
    matter in practice.

    While at it change the loop to a more natural form and remove the
    WARN_ON for I_NEW, wich we already tested a few lines above.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Instead of always assigning an increasing inode number in new_inode
    move the call to assign it into those callers that actually need it.
    For now callers that need it is estimated conservatively, that is
    the call is added to all filesystems that do not assign an i_ino
    by themselves. For a few more filesystems we can avoid assigning
    any inode number given that they aren't user visible, and for others
    it could be done lazily when an inode number is actually needed,
    but that's left for later patches.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • new_inode() dirties a contended cache line to get increasing
    inode numbers. This limits performance on workloads that cause
    significant parallel inode allocation.

    Solve this problem by using a per_cpu variable fed by the shared
    last_ino in batches of 1024 allocations. This reduces contention on
    the shared last_ino, and give same spreading ino numbers than before
    (i.e. same wraparound after 2^32 allocations).

    Signed-off-by: Eric Dumazet
    Signed-off-by: Nick Piggin
    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Eric Dumazet
     
  • Clones an existing reference to inode; caller must already hold one.

    Signed-off-by: Al Viro

    Al Viro
     
  • Split up inode_add_to_list/__inode_add_to_list. Locking for the two
    lists will be split soon so these helpers really don't buy us much
    anymore.

    The __ prefixes for the sb list helpers will go away soon, but until
    inode_lock is gone we'll need them to distinguish between the locked
    and unlocked variants.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Now that iunique is not abusing find_inode anymore we can move the i_ref
    increment back to where it belongs.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Stop abusing find_inode_fast for iunique and opencode the inode hash walk.
    Introduce a new iunique_lock to protect the iunique counters once inode_lock
    is removed.

    Based on a patch originally from Nick Piggin.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Before replacing the inode hash locking with a more scalable
    mechanism, factor the removal of the inode from the hashes rather
    than open coding it in several places.

    Based on a patch originally from Nick Piggin.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Dave Chinner
     
  • Convert the inode LRU to use lazy updates to reduce lock and
    cacheline traffic. We avoid moving inodes around in the LRU list
    during iget/iput operations so these frequent operations don't need
    to access the LRUs. Instead, we defer the refcount checks to
    reclaim-time and use a per-inode state flag, I_REFERENCED, to tell
    reclaim that iget has touched the inode in the past. This means that
    only reclaim should be touching the LRU with any frequency, hence
    significantly reducing lock acquisitions and the amount contention
    on LRU updates.

    This also removes the inode_in_use list, which means we now only
    have one list for tracking the inode LRU status. This makes it much
    simpler to split out the LRU list operations under it's own lock.

    Signed-off-by: Nick Piggin
    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Nick Piggin
     
  • The number of inodes allocated does not need to be tied to the
    addition or removal of an inode to/from a list. If we are not tied
    to a list lock, we could update the counters when inodes are
    initialised or destroyed, but to do that we need to convert the
    counters to be per-cpu (i.e. independent of a lock). This means that
    we have the freedom to change the list/locking implementation
    without needing to care about the counters.

    Based on a patch originally from Eric Dumazet.

    [AV: cleaned up a bit, fixed build breakage on weird configs

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Dave Chinner
     
  • note: for race-free uses you inode_lock held

    Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • Since inode->i_mode shares its bits for S_IFMT, S_ISDIR should be
    used to distinguish whether it is a dir or not.

    Signed-off-by: Namhyung Kim
    Signed-off-by: Al Viro

    Namhyung Kim
     
  • Hugetlbfs used to need it, but after the destroy_inode and evict_inode
    changes it's not required anymore.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig