21 Mar, 2011

1 commit


16 Mar, 2011

1 commit

  • iprune_sem is continously giving us lockdep warnings because we do take it in
    read mode in the reclaim path, but we're also doing non-NOFS allocations under
    it taken in write mode.

    Taking a bit deeper look at it I think it's fixable quite trivially:

    - for invalidate_inodes we do not need iprune_sem at all. We have an active
    reference on the superblock, so the filesystem is not going away until it
    has finished.
    - for evict_inodes we do need it, to make sure prune_icache has done it's
    work before we tear down the superblock. But there is no reason to
    hold it over the actual reclaim operation - it's enough to cycle through
    it after the actual reclaim to make sure we wait for any pending
    prune_icache to complete. We just have to remove the WARN_ON for
    otherwise busy inodes as they can actually happen now.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     

26 Feb, 2011

1 commit

  • * 'for-linus' of git://neil.brown.name/md:
    md: Fix - again - partition detection when array becomes active
    Fix over-zealous flush_disk when changing device size.
    md: avoid spinlock problem in blk_throtl_exit
    md: correctly handle probe of an 'mdp' device.
    md: don't set_capacity before array is active.
    md: Fix raid1->raid0 takeover

    Linus Torvalds
     

24 Feb, 2011

2 commits

  • There are two cases when we call flush_disk.
    In one, the device has disappeared (check_disk_change) so any
    data will hold becomes irrelevant.
    In the oter, the device has changed size (check_disk_size_change)
    so data we hold may be irrelevant.

    In both cases it makes sense to discard any 'clean' buffers,
    so they will be read back from the device if needed.

    In the former case it makes sense to discard 'dirty' buffers
    as there will never be anywhere safe to write the data. In the
    second case it *does*not* make sense to discard dirty buffers
    as that will lead to file system corruption when you simply enlarge
    the containing devices.

    flush_disk calls __invalidate_devices.
    __invalidate_device calls both invalidate_inodes and invalidate_bdev.

    invalidate_inodes *does* discard I_DIRTY inodes and this does lead
    to fs corruption.

    invalidate_bev *does*not* discard dirty pages, but I don't really care
    about that at present.

    So this patch adds a flag to __invalidate_device (calling it
    __invalidate_device2) to indicate whether dirty buffers should be
    killed, and this is passed to invalidate_inodes which can choose to
    skip dirty inodes.

    flusk_disk then passes true from check_disk_change and false from
    check_disk_size_change.

    dm avoids tripping over this problem by calling i_size_write directly
    rathher than using check_disk_size_change.

    md does use check_disk_size_change and so is affected.

    This regression was introduced by commit 608aeef17a which causes
    check_disk_size_change to call flush_disk, so it is suitable for any
    kernel since 2.6.27.

    Cc: stable@kernel.org
    Acked-by: Jeff Moyer
    Cc: Andrew Patterson
    Cc: Jens Axboe
    Signed-off-by: NeilBrown

    NeilBrown
     
  • Michael Leun reported that running parallel opens on a fuse filesystem
    can trigger a "kernel BUG at mm/truncate.c:475"

    Gurudas Pai reported the same bug on NFS.

    The reason is, unmap_mapping_range() is not prepared for more than
    one concurrent invocation per inode. For example:

    thread1: going through a big range, stops in the middle of a vma and
    stores the restart address in vm_truncate_count.

    thread2: comes in with a small (e.g. single page) unmap request on
    the same vma, somewhere before restart_address, finds that the
    vma was already unmapped up to the restart address and happily
    returns without doing anything.

    Another scenario would be two big unmap requests, both having to
    restart the unmapping and each one setting vm_truncate_count to its
    own value. This could go on forever without any of them being able to
    finish.

    Truncate and hole punching already serialize with i_mutex. Other
    callers of unmap_mapping_range() do not, and it's difficult to get
    i_mutex protection for all callers. In particular ->d_revalidate(),
    which calls invalidate_inode_pages2_range() in fuse, may be called
    with or without i_mutex.

    This patch adds a new mutex to 'struct address_space' to prevent
    running multiple concurrent unmap_mapping_range() on the same mapping.

    [ We'll hopefully get rid of all this with the upcoming mm
    preemptibility series by Peter Zijlstra, the "mm: Remove i_mmap_mutex
    lockbreak" patch in particular. But that is for 2.6.39 ]

    Signed-off-by: Miklos Szeredi
    Reported-by: Michael Leun
    Reported-by: Gurudas Pai
    Tested-by: Gurudas Pai
    Acked-by: Hugh Dickins
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     

07 Jan, 2011

4 commits

  • Pseudo filesystems that don't put inode on RCU list or reachable by
    rcu-walk dentries do not need to RCU free their inodes.

    Signed-off-by: Nick Piggin

    Nick Piggin
     
  • RCU free the struct inode. This will allow:

    - Subsequent store-free path walking patch. The inode must be consulted for
    permissions when walking, so an RCU inode reference is a must.
    - sb_inode_list_lock to be moved inside i_lock because sb list walkers who want
    to take i_lock no longer need to take sb_inode_list_lock to walk the list in
    the first place. This will simplify and optimize locking.
    - Could remove some nested trylock loops in dcache code
    - Could potentially simplify things a bit in VM land. Do not need to take the
    page lock to follow page->mapping.

    The downsides of this is the performance cost of using RCU. In a simple
    creat/unlink microbenchmark, performance drops by about 10% due to inability to
    reuse cache-hot slab objects. As iterations increase and RCU freeing starts
    kicking over, this increases to about 20%.

    In cases where inode lifetimes are longer (ie. many inodes may be allocated
    during the average life span of a single inode), a lot of this cache reuse is
    not applicable, so the regression caused by this patch is smaller.

    The cache-hot regression could largely be avoided by using SLAB_DESTROY_BY_RCU,
    however this adds some complexity to list walking and store-free path walking,
    so I prefer to implement this at a later date, if it is shown to be a win in
    real situations. I haven't found a regression in any non-micro benchmark so I
    doubt it will be a problem.

    Signed-off-by: Nick Piggin

    Nick Piggin
     
  • percpu_counter library generates quite nasty code, so unless you need
    to dynamically allocate counters or take fast approximate value, a
    simple per cpu set of counters is much better.

    The percpu_counter can never be made to work as well, because it has an
    indirection from pointer to percpu memory, and it can't use direct
    this_cpu_inc interfaces because it doesn't use static PER_CPU data, so
    code will always be worse.

    In the fastpath, it is the difference between this:

    incl %gs:nr_dentry # nr_dentry

    and this:

    movl percpu_counter_batch(%rip), %edx # percpu_counter_batch,
    movl $1, %esi #,
    movq $nr_dentry, %rdi #,
    call __percpu_counter_add # (plus I clobber registers)

    __percpu_counter_add:
    pushq %rbp #
    movq %rsp, %rbp #,
    subq $32, %rsp #,
    movq %rbx, -24(%rbp) #,
    movq %r12, -16(%rbp) #,
    movq %r13, -8(%rbp) #,
    movq %rdi, %rbx # fbc, fbc
    #APP
    # 216 "/home/npiggin/usr/src/linux-2.6/arch/x86/include/asm/thread_info.h" 1
    movq %gs:kernel_stack,%rax #, pfo_ret__
    # 0 "" 2
    #NO_APP
    incl -8124(%rax) # .preempt_count
    movq 32(%rdi), %r12 # .counters, tcp_ptr__
    #APP
    # 78 "lib/percpu_counter.c" 1
    add %gs:this_cpu_off, %r12 # this_cpu_off, tcp_ptr__
    # 0 "" 2
    #NO_APP
    movslq (%r12),%r13 #* tcp_ptr__, tmp73
    movslq %edx,%rax # batch, batch
    addq %rsi, %r13 # amount, count
    cmpq %rax, %r13 # batch, count
    jge .L27 #,
    negl %edx # tmp76
    movslq %edx,%rdx # tmp76, tmp77
    cmpq %rdx, %r13 # tmp77, count
    jg .L28 #,
    .L27:
    movq %rbx, %rdi # fbc,
    call _raw_spin_lock #
    addq %r13, 8(%rbx) # count, .count
    movq %rbx, %rdi # fbc,
    movl $0, (%r12) #,* tcp_ptr__
    call _raw_spin_unlock #
    .L29:
    #APP
    # 216 "/home/npiggin/usr/src/linux-2.6/arch/x86/include/asm/thread_info.h" 1
    movq %gs:kernel_stack,%rax #, pfo_ret__
    # 0 "" 2
    #NO_APP
    decl -8124(%rax) # .preempt_count
    movq -8136(%rax), %rax #, D.14625
    testb $8, %al #, D.14625
    jne .L32 #,
    .L31:
    movq -24(%rbp), %rbx #,
    movq -16(%rbp), %r12 #,
    movq -8(%rbp), %r13 #,
    leave
    ret
    .p2align 4,,10
    .p2align 3
    .L28:
    movl %r13d, (%r12) # count,*
    jmp .L29 #
    .L32:
    call preempt_schedule #
    .p2align 4,,6
    jmp .L31 #
    .size __percpu_counter_add, .-__percpu_counter_add
    .p2align 4,,15

    Signed-off-by: Nick Piggin

    Nick Piggin
     
  • The nr_unused counters count the number of objects on an LRU, and as such they
    are synchronized with LRU object insertion and removal and scanning, and
    protected under the LRU lock.

    Making it per-cpu does not actually get any concurrency improvements because of
    this lock, and summing the counter is much slower, and
    incrementing/decrementing it costs more code size and is slower too.

    These counters should stay per-LRU, which currently means global.

    Signed-off-by: Nick Piggin

    Nick Piggin
     

27 Oct, 2010

2 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (52 commits)
    split invalidate_inodes()
    fs: skip I_FREEING inodes in writeback_sb_inodes
    fs: fold invalidate_list into invalidate_inodes
    fs: do not drop inode_lock in dispose_list
    fs: inode split IO and LRU lists
    fs: switch bdev inode bdi's correctly
    fs: fix buffer invalidation in invalidate_list
    fsnotify: use dget_parent
    smbfs: use dget_parent
    exportfs: use dget_parent
    fs: use RCU read side protection in d_validate
    fs: clean up dentry lru modification
    fs: split __shrink_dcache_sb
    fs: improve DCACHE_REFERENCED usage
    fs: use percpu counter for nr_dentry and nr_dentry_unused
    fs: simplify __d_free
    fs: take dcache_lock inside __d_path
    fs: do not assign default i_ino in new_inode
    fs: introduce a per-cpu last_ino allocator
    new helper: ihold()
    ...

    Linus Torvalds
     
  • IMA currently allocated an inode integrity structure for every inode in
    core. This stucture is about 120 bytes long. Most files however
    (especially on a system which doesn't make use of IMA) will never need
    any of this space. The problem is that if IMA is enabled we need to
    know information about the number of readers and the number of writers
    for every inode on the box. At the moment we collect that information
    in the per inode iint structure and waste the rest of the space. This
    patch moves those counters into the struct inode so we can eventually
    stop allocating an IMA integrity structure except when absolutely
    needed.

    This patch does the minimum needed to move the location of the data.
    Further cleanups, especially the location of counter updates, may still
    be possible.

    Signed-off-by: Eric Paris
    Acked-by: Mimi Zohar
    Signed-off-by: Linus Torvalds

    Eric Paris
     

26 Oct, 2010

18 commits

  • Pull removal of fsnotify marks into generic_shutdown_super().
    Split umount-time work into a new function - evict_inodes().
    Make sure that invalidate_inodes() will be able to cope with
    I_FREEING once we change locking in iput().

    Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Despite the comment above it we can not safely drop the lock here.
    invalidate_list is called from many other places that just umount.
    Also switch to proper list macros now that we never drop the lock.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • The use of the same inode list structure (inode->i_list) for two
    different list constructs with different lifecycles and purposes
    makes it impossible to separate the locking of the different
    operations. Therefore, to enable the separation of the locking of
    the writeback and reclaim lists, split the inode->i_list into two
    separate lists dedicated to their specific tracking functions.

    Signed-off-by: Nick Piggin
    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Nick Piggin
     
  • We must not call invalidate_inode_buffers in invalidate_list unless the
    inode can be reclaimed. If we remove the buffer association of a busy
    inode fsync won't find the buffers anymore. As invalidate_inode_buffers
    is called from various others sources than umount this actually does
    matter in practice.

    While at it change the loop to a more natural form and remove the
    WARN_ON for I_NEW, wich we already tested a few lines above.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Instead of always assigning an increasing inode number in new_inode
    move the call to assign it into those callers that actually need it.
    For now callers that need it is estimated conservatively, that is
    the call is added to all filesystems that do not assign an i_ino
    by themselves. For a few more filesystems we can avoid assigning
    any inode number given that they aren't user visible, and for others
    it could be done lazily when an inode number is actually needed,
    but that's left for later patches.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • new_inode() dirties a contended cache line to get increasing
    inode numbers. This limits performance on workloads that cause
    significant parallel inode allocation.

    Solve this problem by using a per_cpu variable fed by the shared
    last_ino in batches of 1024 allocations. This reduces contention on
    the shared last_ino, and give same spreading ino numbers than before
    (i.e. same wraparound after 2^32 allocations).

    Signed-off-by: Eric Dumazet
    Signed-off-by: Nick Piggin
    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Eric Dumazet
     
  • Clones an existing reference to inode; caller must already hold one.

    Signed-off-by: Al Viro

    Al Viro
     
  • Split up inode_add_to_list/__inode_add_to_list. Locking for the two
    lists will be split soon so these helpers really don't buy us much
    anymore.

    The __ prefixes for the sb list helpers will go away soon, but until
    inode_lock is gone we'll need them to distinguish between the locked
    and unlocked variants.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Now that iunique is not abusing find_inode anymore we can move the i_ref
    increment back to where it belongs.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Stop abusing find_inode_fast for iunique and opencode the inode hash walk.
    Introduce a new iunique_lock to protect the iunique counters once inode_lock
    is removed.

    Based on a patch originally from Nick Piggin.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Before replacing the inode hash locking with a more scalable
    mechanism, factor the removal of the inode from the hashes rather
    than open coding it in several places.

    Based on a patch originally from Nick Piggin.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Dave Chinner
     
  • Convert the inode LRU to use lazy updates to reduce lock and
    cacheline traffic. We avoid moving inodes around in the LRU list
    during iget/iput operations so these frequent operations don't need
    to access the LRUs. Instead, we defer the refcount checks to
    reclaim-time and use a per-inode state flag, I_REFERENCED, to tell
    reclaim that iget has touched the inode in the past. This means that
    only reclaim should be touching the LRU with any frequency, hence
    significantly reducing lock acquisitions and the amount contention
    on LRU updates.

    This also removes the inode_in_use list, which means we now only
    have one list for tracking the inode LRU status. This makes it much
    simpler to split out the LRU list operations under it's own lock.

    Signed-off-by: Nick Piggin
    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Nick Piggin
     
  • The number of inodes allocated does not need to be tied to the
    addition or removal of an inode to/from a list. If we are not tied
    to a list lock, we could update the counters when inodes are
    initialised or destroyed, but to do that we need to convert the
    counters to be per-cpu (i.e. independent of a lock). This means that
    we have the freedom to change the list/locking implementation
    without needing to care about the counters.

    Based on a patch originally from Eric Dumazet.

    [AV: cleaned up a bit, fixed build breakage on weird configs

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Dave Chinner
     
  • note: for race-free uses you inode_lock held

    Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • Since inode->i_mode shares its bits for S_IFMT, S_ISDIR should be
    used to distinguish whether it is a dir or not.

    Signed-off-by: Namhyung Kim
    Signed-off-by: Al Viro

    Namhyung Kim
     
  • Hugetlbfs used to need it, but after the destroy_inode and evict_inode
    changes it's not required anymore.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     

11 Aug, 2010

1 commit

  • * 'for-linus' of git://git.infradead.org/users/eparis/notify: (132 commits)
    fanotify: use both marks when possible
    fsnotify: pass both the vfsmount mark and inode mark
    fsnotify: walk the inode and vfsmount lists simultaneously
    fsnotify: rework ignored mark flushing
    fsnotify: remove global fsnotify groups lists
    fsnotify: remove group->mask
    fsnotify: remove the global masks
    fsnotify: cleanup should_send_event
    fanotify: use the mark in handler functions
    audit: use the mark in handler functions
    dnotify: use the mark in handler functions
    inotify: use the mark in handler functions
    fsnotify: send fsnotify_mark to groups in event handling functions
    fsnotify: Exchange list heads instead of moving elements
    fsnotify: srcu to protect read side of inode and vfsmount locks
    fsnotify: use an explicit flag to indicate fsnotify_destroy_mark has been called
    fsnotify: use _rcu functions for mark list traversal
    fsnotify: place marks on object in order of group memory address
    vfs/fsnotify: fsnotify_close can delay the final work in fput
    fsnotify: store struct file not struct path
    ...

    Fix up trivial delete/modify conflict in fs/notify/inotify/inotify.c.

    Linus Torvalds
     

10 Aug, 2010

10 commits