Eric Lee / smarc-fsl-linux-kernel

21 Mar, 2011

1 commit

eaae668d0 fs/inode: Fix kernel-doc format for inode_init_owner ... Browse Code »

Signed-off-by: Ben Hutchings
Signed-off-by: Al Viro

Ben Hutchings
2011-03-21 12:16:08 +0800

16 Mar, 2011

1 commit

bab1d9444 prune back iprune_sem ... Browse Code »

iprune_sem is continously giving us lockdep warnings because we do take it in
read mode in the reclaim path, but we're also doing non-NOFS allocations under
it taken in write mode.

Taking a bit deeper look at it I think it's fixable quite trivially:

- for invalidate_inodes we do not need iprune_sem at all. We have an active
reference on the superblock, so the filesystem is not going away until it
has finished.
- for evict_inodes we do need it, to make sure prune_icache has done it's
work before we tear down the superblock. But there is no reason to
hold it over the actual reclaim operation - it's enough to cycle through
it after the actual reclaim to make sure we wait for any pending
prune_icache to complete. We just have to remove the WARN_ON for
otherwise busy inodes as they can actually happen now.

Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro

Christoph Hellwig
2011-03-16 21:56:03 +0800

26 Feb, 2011

1 commit

638691a7a Merge branch 'for-linus' of git://neil.brown.name/md ... Browse Code »

* 'for-linus' of git://neil.brown.name/md:
md: Fix - again - partition detection when array becomes active
Fix over-zealous flush_disk when changing device size.
md: avoid spinlock problem in blk_throtl_exit
md: correctly handle probe of an 'mdp' device.
md: don't set_capacity before array is active.
md: Fix raid1->raid0 takeover

Linus Torvalds
2011-02-26 03:13:26 +0800

24 Feb, 2011

2 commits

93b270f76 Fix over-zealous flush_disk when changing device size. ... Browse Code »

There are two cases when we call flush_disk.
In one, the device has disappeared (check_disk_change) so any
data will hold becomes irrelevant.
In the oter, the device has changed size (check_disk_size_change)
so data we hold may be irrelevant.

In both cases it makes sense to discard any 'clean' buffers,
so they will be read back from the device if needed.

In the former case it makes sense to discard 'dirty' buffers
as there will never be anywhere safe to write the data. In the
second case it *does*not* make sense to discard dirty buffers
as that will lead to file system corruption when you simply enlarge
the containing devices.

flush_disk calls __invalidate_devices.
__invalidate_device calls both invalidate_inodes and invalidate_bdev.

invalidate_inodes *does* discard I_DIRTY inodes and this does lead
to fs corruption.

invalidate_bev *does*not* discard dirty pages, but I don't really care
about that at present.

So this patch adds a flag to __invalidate_device (calling it
__invalidate_device2) to indicate whether dirty buffers should be
killed, and this is passed to invalidate_inodes which can choose to
skip dirty inodes.

flusk_disk then passes true from check_disk_change and false from
check_disk_size_change.

dm avoids tripping over this problem by calling i_size_write directly
rathher than using check_disk_size_change.

md does use check_disk_size_change and so is affected.

This regression was introduced by commit 608aeef17a which causes
check_disk_size_change to call flush_disk, so it is suitable for any
kernel since 2.6.27.

Cc: stable@kernel.org
Acked-by: Jeff Moyer
Cc: Andrew Patterson
Cc: Jens Axboe
Signed-off-by: NeilBrown

NeilBrown
2011-02-24 14:25:47 +0800
2aa15890f mm: prevent concurrent unmap_mapping_range() on the same inode ... Browse Code »

Michael Leun reported that running parallel opens on a fuse filesystem
can trigger a "kernel BUG at mm/truncate.c:475"

Gurudas Pai reported the same bug on NFS.

The reason is, unmap_mapping_range() is not prepared for more than
one concurrent invocation per inode. For example:

thread1: going through a big range, stops in the middle of a vma and
stores the restart address in vm_truncate_count.

thread2: comes in with a small (e.g. single page) unmap request on
the same vma, somewhere before restart_address, finds that the
vma was already unmapped up to the restart address and happily
returns without doing anything.

Another scenario would be two big unmap requests, both having to
restart the unmapping and each one setting vm_truncate_count to its
own value. This could go on forever without any of them being able to
finish.

Truncate and hole punching already serialize with i_mutex. Other
callers of unmap_mapping_range() do not, and it's difficult to get
i_mutex protection for all callers. In particular ->d_revalidate(),
which calls invalidate_inode_pages2_range() in fuse, may be called
with or without i_mutex.

This patch adds a new mutex to 'struct address_space' to prevent
running multiple concurrent unmap_mapping_range() on the same mapping.

[ We'll hopefully get rid of all this with the upcoming mm
preemptibility series by Peter Zijlstra, the "mm: Remove i_mmap_mutex
lockbreak" patch in particular. But that is for 2.6.39 ]

Signed-off-by: Miklos Szeredi
Reported-by: Michael Leun
Reported-by: Gurudas Pai
Tested-by: Gurudas Pai
Acked-by: Hugh Dickins
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds

Miklos Szeredi
2011-02-24 11:52:52 +0800

07 Jan, 2011

4 commits

ff0c7d15f fs: avoid inode RCU freeing for pseudo fs ... Browse Code »

Pseudo filesystems that don't put inode on RCU list or reachable by
rcu-walk dentries do not need to RCU free their inodes.

Signed-off-by: Nick Piggin

Nick Piggin
2011-01-07 14:50:26 +0800
fa0d7e3de fs: icache RCU free inodes ... Browse Code »

RCU free the struct inode. This will allow:

- Subsequent store-free path walking patch. The inode must be consulted for
permissions when walking, so an RCU inode reference is a must.
- sb_inode_list_lock to be moved inside i_lock because sb list walkers who want
to take i_lock no longer need to take sb_inode_list_lock to walk the list in
the first place. This will simplify and optimize locking.
- Could remove some nested trylock loops in dcache code
- Could potentially simplify things a bit in VM land. Do not need to take the
page lock to follow page->mapping.

The downsides of this is the performance cost of using RCU. In a simple
creat/unlink microbenchmark, performance drops by about 10% due to inability to
reuse cache-hot slab objects. As iterations increase and RCU freeing starts
kicking over, this increases to about 20%.

In cases where inode lifetimes are longer (ie. many inodes may be allocated
during the average life span of a single inode), a lot of this cache reuse is
not applicable, so the regression caused by this patch is smaller.

The cache-hot regression could largely be avoided by using SLAB_DESTROY_BY_RCU,
however this adds some complexity to list walking and store-free path walking,
so I prefer to implement this at a later date, if it is shown to be a win in
real situations. I haven't found a regression in any non-micro benchmark so I
doubt it will be a problem.

Signed-off-by: Nick Piggin

Nick Piggin
2011-01-07 14:50:26 +0800
3e880fb5e fs: use fast counters for vfs caches ... Browse Code »

percpu_counter library generates quite nasty code, so unless you need
to dynamically allocate counters or take fast approximate value, a
simple per cpu set of counters is much better.

The percpu_counter can never be made to work as well, because it has an
indirection from pointer to percpu memory, and it can't use direct
this_cpu_inc interfaces because it doesn't use static PER_CPU data, so
code will always be worse.

In the fastpath, it is the difference between this:

incl %gs:nr_dentry # nr_dentry

and this:

movl percpu_counter_batch(%rip), %edx # percpu_counter_batch,
movl $1, %esi #,
movq $nr_dentry, %rdi #,
call __percpu_counter_add # (plus I clobber registers)

__percpu_counter_add:
pushq %rbp #
movq %rsp, %rbp #,
subq $32, %rsp #,
movq %rbx, -24(%rbp) #,
movq %r12, -16(%rbp) #,
movq %r13, -8(%rbp) #,
movq %rdi, %rbx # fbc, fbc
#APP
# 216 "/home/npiggin/usr/src/linux-2.6/arch/x86/include/asm/thread_info.h" 1
movq %gs:kernel_stack,%rax #, pfo_ret__
# 0 "" 2
#NO_APP
incl -8124(%rax) # .preempt_count
movq 32(%rdi), %r12 # .counters, tcp_ptr__
#APP
# 78 "lib/percpu_counter.c" 1
add %gs:this_cpu_off, %r12 # this_cpu_off, tcp_ptr__
# 0 "" 2
#NO_APP
movslq (%r12),%r13 #* tcp_ptr__, tmp73
movslq %edx,%rax # batch, batch
addq %rsi, %r13 # amount, count
cmpq %rax, %r13 # batch, count
jge .L27 #,
negl %edx # tmp76
movslq %edx,%rdx # tmp76, tmp77
cmpq %rdx, %r13 # tmp77, count
jg .L28 #,
.L27:
movq %rbx, %rdi # fbc,
call _raw_spin_lock #
addq %r13, 8(%rbx) # count, .count
movq %rbx, %rdi # fbc,
movl $0, (%r12) #,* tcp_ptr__
call _raw_spin_unlock #
.L29:
#APP
# 216 "/home/npiggin/usr/src/linux-2.6/arch/x86/include/asm/thread_info.h" 1
movq %gs:kernel_stack,%rax #, pfo_ret__
# 0 "" 2
#NO_APP
decl -8124(%rax) # .preempt_count
movq -8136(%rax), %rax #, D.14625
testb $8, %al #, D.14625
jne .L32 #,
.L31:
movq -24(%rbp), %rbx #,
movq -16(%rbp), %r12 #,
movq -8(%rbp), %r13 #,
leave
ret
.p2align 4,,10
.p2align 3
.L28:
movl %r13d, (%r12) # count,*
jmp .L29 #
.L32:
call preempt_schedule #
.p2align 4,,6
jmp .L31 #
.size __percpu_counter_add, .-__percpu_counter_add
.p2align 4,,15

Signed-off-by: Nick Piggin

Nick Piggin
2011-01-07 14:50:17 +0800
86c8749ed vfs: revert per-cpu nr_unused counters for dentry and inodes ... Browse Code »

The nr_unused counters count the number of objects on an LRU, and as such they
are synchronized with LRU object insertion and removal and scanning, and
protected under the LRU lock.

Making it per-cpu does not actually get any concurrency improvements because of
this lock, and summing the counter is much slower, and
incrementing/decrementing it costs more code size and is slower too.

These counters should stay per-LRU, which currently means global.

Signed-off-by: Nick Piggin

Nick Piggin
2011-01-07 14:50:17 +0800

27 Oct, 2010

2 commits

426e1f5ce Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (52 commits)
split invalidate_inodes()
fs: skip I_FREEING inodes in writeback_sb_inodes
fs: fold invalidate_list into invalidate_inodes
fs: do not drop inode_lock in dispose_list
fs: inode split IO and LRU lists
fs: switch bdev inode bdi's correctly
fs: fix buffer invalidation in invalidate_list
fsnotify: use dget_parent
smbfs: use dget_parent
exportfs: use dget_parent
fs: use RCU read side protection in d_validate
fs: clean up dentry lru modification
fs: split __shrink_dcache_sb
fs: improve DCACHE_REFERENCED usage
fs: use percpu counter for nr_dentry and nr_dentry_unused
fs: simplify __d_free
fs: take dcache_lock inside __d_path
fs: do not assign default i_ino in new_inode
fs: introduce a per-cpu last_ino allocator
new helper: ihold()
...

Linus Torvalds
2010-10-27 08:58:44 +0800
a178d2027 IMA: move read counter into struct inode ... Browse Code »

IMA currently allocated an inode integrity structure for every inode in
core. This stucture is about 120 bytes long. Most files however
(especially on a system which doesn't make use of IMA) will never need
any of this space. The problem is that if IMA is enabled we need to
know information about the number of readers and the number of writers
for every inode on the box. At the moment we collect that information
in the per inode iint structure and waste the rest of the space. This
patch moves those counters into the struct inode so we can eventually
stop allocating an IMA integrity structure except when absolutely
needed.

This patch does the minimum needed to move the location of the data.
Further cleanups, especially the location of counter updates, may still
be possible.

Signed-off-by: Eric Paris
Acked-by: Mimi Zohar
Signed-off-by: Linus Torvalds

Eric Paris
2010-10-27 02:37:18 +0800

26 Oct, 2010

18 commits

63997e98a split invalidate_inodes() ... Browse Code »

Pull removal of fsnotify marks into generic_shutdown_super().
Split umount-time work into a new function - evict_inodes().
Make sure that invalidate_inodes() will be able to cope with
I_FREEING once we change locking in iput().

Signed-off-by: Al Viro

Al Viro
2010-10-26 09:27:18 +0800
a03187867 fs: fold invalidate_list into invalidate_inodes ... Browse Code »

Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro

Christoph Hellwig
2010-10-26 09:26:15 +0800
d895a1c96 fs: do not drop inode_lock in dispose_list ... Browse Code »

Despite the comment above it we can not safely drop the lock here.
invalidate_list is called from many other places that just umount.
Also switch to proper list macros now that we never drop the lock.

Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro

Christoph Hellwig
2010-10-26 09:26:15 +0800
7ccf19a80 fs: inode split IO and LRU lists ... Browse Code »

The use of the same inode list structure (inode->i_list) for two
different list constructs with different lifecycles and purposes
makes it impossible to separate the locking of the different
operations. Therefore, to enable the separation of the locking of
the writeback and reclaim lists, split the inode->i_list into two
separate lists dedicated to their specific tracking functions.

Signed-off-by: Nick Piggin
Signed-off-by: Dave Chinner
Reviewed-by: Christoph Hellwig
Signed-off-by: Al Viro

Nick Piggin
2010-10-26 09:26:15 +0800
99a389192 fs: fix buffer invalidation in invalidate_list ... Browse Code »

We must not call invalidate_inode_buffers in invalidate_list unless the
inode can be reclaimed. If we remove the buffer association of a busy
inode fsync won't find the buffers anymore. As invalidate_inode_buffers
is called from various others sources than umount this actually does
matter in practice.

While at it change the loop to a more natural form and remove the
WARN_ON for I_NEW, wich we already tested a few lines above.

Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro

Christoph Hellwig
2010-10-26 09:26:14 +0800
85fe4025c fs: do not assign default i_ino in new_inode ... Browse Code »
46

Instead of always assigning an increasing inode number in new_inode
move the call to assign it into those callers that actually need it.
For now callers that need it is estimated conservatively, that is
the call is added to all filesystems that do not assign an i_ino
by themselves. For a few more filesystems we can avoid assigning
any inode number given that they aren't user visible, and for others
it could be done lazily when an inode number is actually needed,
but that's left for later patches.

Signed-off-by: Christoph Hellwig
Signed-off-by: Dave Chinner
Signed-off-by: Al Viro

Christoph Hellwig
2010-10-26 09:26:11 +0800
f991bd2e1 fs: introduce a per-cpu last_ino allocator ... Browse Code »

new_inode() dirties a contended cache line to get increasing
inode numbers. This limits performance on workloads that cause
significant parallel inode allocation.

Solve this problem by using a per_cpu variable fed by the shared
last_ino in batches of 1024 allocations. This reduces contention on
the shared last_ino, and give same spreading ino numbers than before
(i.e. same wraparound after 2^32 allocations).

Signed-off-by: Eric Dumazet
Signed-off-by: Nick Piggin
Signed-off-by: Dave Chinner
Reviewed-by: Christoph Hellwig
Signed-off-by: Al Viro

Eric Dumazet
2010-10-26 09:26:11 +0800
7de9c6ee3 new helper: ihold() ... Browse Code »

Clones an existing reference to inode; caller must already hold one.

Signed-off-by: Al Viro

Al Viro
2010-10-26 09:26:11 +0800
646ec4615 fs: remove inode_add_to_list/__inode_add_to_list ... Browse Code »

Split up inode_add_to_list/__inode_add_to_list. Locking for the two
lists will be split soon so these helpers really don't buy us much
anymore.

The __ prefixes for the sb list helpers will go away soon, but until
inode_lock is gone we'll need them to distinguish between the locked
and unlocked variants.

Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro

Christoph Hellwig
2010-10-26 09:26:10 +0800
f7899bd54 fs: move i_count increments into find_inode/find_inode_fast ... Browse Code »

Now that iunique is not abusing find_inode anymore we can move the i_ref
increment back to where it belongs.

Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro

Christoph Hellwig
2010-10-26 09:26:10 +0800
ad5e195ac fs: Stop abusing find_inode_fast in iunique ... Browse Code »

Stop abusing find_inode_fast for iunique and opencode the inode hash walk.
Introduce a new iunique_lock to protect the iunique counters once inode_lock
is removed.

Based on a patch originally from Nick Piggin.

Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro

Christoph Hellwig
2010-10-26 09:26:10 +0800
4c51acbc6 fs: Factor inode hash operations into functions ... Browse Code »

Before replacing the inode hash locking with a more scalable
mechanism, factor the removal of the inode from the hashes rather
than open coding it in several places.

Based on a patch originally from Nick Piggin.

Signed-off-by: Dave Chinner
Reviewed-by: Christoph Hellwig
Signed-off-by: Al Viro

Dave Chinner
2010-10-26 09:26:10 +0800
9e38d86ff fs: Implement lazy LRU updates for inodes ... Browse Code »

Convert the inode LRU to use lazy updates to reduce lock and
cacheline traffic. We avoid moving inodes around in the LRU list
during iget/iput operations so these frequent operations don't need
to access the LRUs. Instead, we defer the refcount checks to
reclaim-time and use a per-inode state flag, I_REFERENCED, to tell
reclaim that iget has touched the inode in the past. This means that
only reclaim should be touching the LRU with any frequency, hence
significantly reducing lock acquisitions and the amount contention
on LRU updates.

This also removes the inode_in_use list, which means we now only
have one list for tracking the inode LRU status. This makes it much
simpler to split out the LRU list operations under it's own lock.

Signed-off-by: Nick Piggin
Signed-off-by: Dave Chinner
Reviewed-by: Christoph Hellwig
Signed-off-by: Al Viro

Nick Piggin
2010-10-26 09:26:09 +0800
cffbc8aa3 fs: Convert nr_inodes and nr_unused to per-cpu counters ... Browse Code »

The number of inodes allocated does not need to be tied to the
addition or removal of an inode to/from a list. If we are not tied
to a list lock, we could update the counters when inodes are
initialised or destroyed, but to do that we need to convert the
counters to be per-cpu (i.e. independent of a lock). This means that
we have the freedom to change the list/locking implementation
without needing to care about the counters.

Based on a patch originally from Eric Dumazet.

[AV: cleaned up a bit, fixed build breakage on weird configs

Signed-off-by: Dave Chinner
Reviewed-by: Christoph Hellwig
Signed-off-by: Al Viro

Dave Chinner
2010-10-26 09:26:09 +0800
1d3382cbf new helper: inode_unhashed() ... Browse Code »

note: for race-free uses you inode_lock held

Signed-off-by: Al Viro

Al Viro
2010-10-26 09:24:15 +0800
a8dade34e unexport invalidate_inodes ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2010-10-26 09:23:32 +0800
a3314a0ed lockdep: fixup checking of dir inode annotation ... Browse Code »

Since inode->i_mode shares its bits for S_IFMT, S_ISDIR should be
used to distinguish whether it is a dir or not.

Signed-off-by: Namhyung Kim
Signed-off-by: Al Viro

Namhyung Kim
2010-10-26 09:18:23 +0800
56b0dacfa fs: mark destroy_inode static ... Browse Code »

Hugetlbfs used to need it, but after the destroy_inode and evict_inode
changes it's not required anymore.

Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro

Christoph Hellwig
2010-10-26 09:18:19 +0800

11 Aug, 2010

1 commit

8c8946f50 Merge branch 'for-linus' of git://git.infradead.org/users/eparis/notify ... Browse Code »

* 'for-linus' of git://git.infradead.org/users/eparis/notify: (132 commits)
fanotify: use both marks when possible
fsnotify: pass both the vfsmount mark and inode mark
fsnotify: walk the inode and vfsmount lists simultaneously
fsnotify: rework ignored mark flushing
fsnotify: remove global fsnotify groups lists
fsnotify: remove group->mask
fsnotify: remove the global masks
fsnotify: cleanup should_send_event
fanotify: use the mark in handler functions
audit: use the mark in handler functions
dnotify: use the mark in handler functions
inotify: use the mark in handler functions
fsnotify: send fsnotify_mark to groups in event handling functions
fsnotify: Exchange list heads instead of moving elements
fsnotify: srcu to protect read side of inode and vfsmount locks
fsnotify: use an explicit flag to indicate fsnotify_destroy_mark has been called
fsnotify: use _rcu functions for mark list traversal
fsnotify: place marks on object in order of group memory address
vfs/fsnotify: fsnotify_close can delay the final work in fput
fsnotify: store struct file not struct path
...

Fix up trivial delete/modify conflict in fs/notify/inotify/inotify.c.

Linus Torvalds
2010-08-11 02:39:13 +0800

10 Aug, 2010

10 commits

b70a3e070 All filesystems that need invalidate_inode_buffers() are doing that explicitly ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2010-08-10 04:48:39 +0800
b57922d97 convert remaining ->clear_inode() to ->evict_inode() ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2010-08-10 04:48:37 +0800
45321ac54 Make ->drop_inode() just return whether inode needs to be dropped ... Browse Code »

... and let iput_final() do the actual eviction or retention

Signed-off-by: Al Viro

Al Viro
2010-08-10 04:48:35 +0800
30140837f fs/inode.c:clear_inode() is gone ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2010-08-10 04:48:34 +0800
644da5960 fs/inode.c:evict() doesn't care about delete vs. non-delete paths now ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2010-08-10 04:48:33 +0800
07958f9f5 ->delete_inode() is gone ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2010-08-10 04:48:31 +0800
b0683aa63 new helper: end_writeback() ... Browse Code »

Essentially, the minimal variant of ->evict_inode(). It's
a trimmed-down clear_inode(), sans any fs callbacks. Once
it returns we know that no async writeback will be happening;
every ->evict_inode() instance should do that once and do that
before doing anything ->write_inode() could interfere with
(e.g. freeing the on-disk inode).

Signed-off-by: Al Viro

Al Viro
2010-08-10 04:47:49 +0800
661074e91 Take ->i_bdev/->i_cdev handling out of clear_inode() ... Browse Code »

All call chains to clear_inode() pass through evict_inode() and
clear_inode() should be called by evict_inode() exactly once.
So we can pull i_bdev/i_cdev detaching up to evict_inode() itself.

Signed-off-by: Al Viro

Al Viro
2010-08-10 04:47:48 +0800
c6287315c generic_detach_inode() can be static now ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2010-08-10 04:47:48 +0800
be7ce4161 New method - evict_inode() ... Browse Code »

Hybrid of ->clear_inode() and ->delete_inode(); if present, does
all fs work to be done when in-core inode is about to be gone,
for whatever reason.

Signed-off-by: Al Viro

Al Viro
2010-08-10 04:47:46 +0800