Doug / smarc-fsl-linux-kernel | Embedian Git Server

29 May, 2012

1 commit

90324cc1b Merge tag 'writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux ... Browse Code »

Pull writeback tree from Wu Fengguang:
"Mainly from Jan Kara to avoid iput() in the flusher threads."

* tag 'writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
writeback: Avoid iput() from flusher thread
vfs: Rename end_writeback() to clear_inode()
vfs: Move waiting for inode writeback from end_writeback() to evict_inode()
writeback: Refactor writeback_single_inode()
writeback: Remove wb->list_lock from writeback_single_inode()
writeback: Separate inode requeueing after writeback
writeback: Move I_DIRTY_PAGES handling
writeback: Move requeueing when I_SYNC set to writeback_sb_inodes()
writeback: Move clearing of I_SYNC into inode_sync_complete()
writeback: initialize global_dirty_limit
fs: remove 8 bytes of padding from struct writeback_control on 64 bit builds
mm: page-writeback.c: local functions should not be exposed globally

Linus Torvalds
2012-05-29 00:54:45 +0800

25 May, 2012

1 commit

28f3d7176 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Pull more networking updates from David Miller:
"Ok, everything from here on out will be bug fixes."

1) One final sync of wireless and bluetooth stuff from John Linville.
These changes have all been in his tree for more than a week, and
therefore have had the necessary -next exposure. John was just away
on a trip and didn't have a change to send the pull request until a
day or two ago.

2) Put back some defines in user exposed header file areas that were
removed during the tokenring purge. From Stephen Hemminger and Paul
Gortmaker.

3) A bug fix for UDP hash table allocation got lost in the pile due to
one of those "you got it.. no I've got it.." situations. :-)

From Tim Bird.

4) SKB coalescing in TCP needs to have stricter checks, otherwise we'll
try to coalesce overlapping frags and crash. Fix from Eric Dumazet.

5) RCU routing table lookups can race with free_fib_info(), causing
crashes when we deref the device pointers in the route. Fix by
releasing the net device in the RCU callback. From Yanmin Zhang.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (293 commits)
tcp: take care of overlaps in tcp_try_coalesce()
ipv4: fix the rcu race between free_fib_info and ip_route_output_slow
mm: add a low limit to alloc_large_system_hash
ipx: restore token ring define to include/linux/ipx.h
if: restore token ring ARP type to header
xen: do not disable netfront in dom0
phy/micrel: Fix ID of KSZ9021
mISDN: Add X-Tensions USB ISDN TA XC-525
gianfar:don't add FCB length to hard_header_len
Bluetooth: Report proper error number in disconnection
Bluetooth: Create flags for bt_sk()
Bluetooth: report the right security level in getsockopt
Bluetooth: Lock the L2CAP channel when sending
Bluetooth: Restore locking semantics when looking up L2CAP channels
Bluetooth: Fix a redundant and problematic incoming MTU check
Bluetooth: Add support for Foxconn/Hon Hai AR5BBU22 0489:E03C
Bluetooth: Fix EIR data generation for mgmt_device_found
Bluetooth: Fix Inquiry with RSSI event mask
Bluetooth: improve readability of l2cap_seq_list code
Bluetooth: Fix skb length calculation
...

Linus Torvalds
2012-05-25 02:54:29 +0800

24 May, 2012

1 commit

31fe62b95 mm: add a low limit to alloc_large_system_hash ... Browse Code »

UDP stack needs a minimum hash size value for proper operation and also
uses alloc_large_system_hash() for proper NUMA distribution of its hash
tables and automatic sizing depending on available system memory.

On some low memory situations, udp_table_init() must ignore the
alloc_large_system_hash() result and reallocs a bigger memory area.

As we cannot easily free old hash table, we leak it and kmemleak can
issue a warning.

This patch adds a low limit parameter to alloc_large_system_hash() to
solve this problem.

We then specify UDP_HTABLE_SIZE_MIN for UDP/UDPLite hash table
allocation.

Reported-by: Mark Asselstine
Reported-by: Tim Bird
Signed-off-by: Eric Dumazet
Cc: Paul Gortmaker
Signed-off-by: David S. Miller

Tim Bird
2012-05-24 12:28:21 +0800

06 May, 2012

3 commits

169ebd901 writeback: Avoid iput() from flusher thread ... Browse Code »

Doing iput() from flusher thread (writeback_sb_inodes()) can create problems
because iput() can do a lot of work - for example truncate the inode if it's
the last iput on unlinked file. Some filesystems depend on flusher thread
progressing (e.g. because they need to flush delay allocated blocks to reduce
allocation uncertainty) and so flusher thread doing truncate creates
interesting dependencies and possibilities for deadlocks.

We get rid of iput() in flusher thread by using the fact that I_SYNC inode
flag effectively pins the inode in memory. So if we take care to either hold
i_lock or have I_SYNC set, we can get away without taking inode reference
in writeback_sb_inodes().

As a side effect of these changes, we also fix possible use-after-free in
wb_writeback() because inode_wait_for_writeback() call could try to reacquire
i_lock on the inode that was already free.

Signed-off-by: Jan Kara
Signed-off-by: Fengguang Wu

Jan Kara
2012-05-06 13:43:41 +0800
dbd5768f8 vfs: Rename end_writeback() to clear_inode() ... Browse Code »

After we moved inode_sync_wait() from end_writeback() it doesn't make sense
to call the function end_writeback() anymore. Rename it to clear_inode()
which well says what the function really does - set I_CLEAR flag.

Signed-off-by: Jan Kara
Signed-off-by: Fengguang Wu

Jan Kara
2012-05-06 13:43:41 +0800
7994e6f72 vfs: Move waiting for inode writeback from end_writeback() to evict_inode() ... Browse Code »

Currently, I_SYNC can never be set when evict_inode() (and thus
end_writeback()) is called because flusher thread holds inode reference while
inode is under writeback. As a result inode_sync_wait() in those places
currently does nothing. However that is going to change and unveils problems
with calling inode_sync_wait() from end_writeback(). Several filesystems call
end_writeback() after they have deleted the inode (btrfs, gfs2, ...) and other
filesystems (ext3, ext4, reiserfs, ...) can deadlock when waiting for I_SYNC
because they call end_writeback() from within a transaction.

To avoid these issues, we move inode_sync_wait() into evict_inode() before
calling ->evict_inode(). That way we preserve the current property that
->evict_inode() and writeback never run in parallel and all filesystems are
safe.

Signed-off-by: Jan Kara
Signed-off-by: Fengguang Wu

Jan Kara
2012-05-06 13:43:40 +0800

03 May, 2012

1 commit

92361636e userns: Store uid and gid types in vfs structures with kuid_t and kgid_t types ... Browse Code »

The conversion of all of the users is not done yet there are too many to change
in one go and leave the code reviewable. For now I change just the header and
a few trivial users and rely on CONFIG_UIDGID_STRICT_TYPE_CHECKS not being set
to ensure that the code will still compile during the transition.

Helper functions i_uid_read, i_uid_write, i_gid_read, i_gid_write are added
so that in most cases filesystems can avoid the complexities of multiple user
namespaces and can concentrate on moving their raw numeric values into and
out of the vfs data structures.

Acked-by: Serge Hallyn
Signed-off-by: Eric W. Biederman

Eric W. Biederman
2012-05-03 18:29:32 +0800

08 Apr, 2012

1 commit

1a48e2ac0 userns: Replace the hard to write inode_userns with inode_capable. ... Browse Code »

This represents a change in strategy of how to handle user namespaces.
Instead of tagging everything explicitly with a user namespace and bulking
up all of the comparisons of uids and gids in the kernel, all uids and gids
in use will have a mapping to a flat kuid and kgid spaces respectively. This
allows much more of the existing logic to be preserved and in general
allows for faster code.

In this new and improved world we allow someone to utiliize capabilities
over an inode if the inodes owner mapps into the capabilities holders user
namespace and the user has capabilities in their user namespace. Which
is simple and efficient.

Moving the fs uid comparisons to be comparisons in a flat kuid space
follows in later patches, something that is only significant if you
are using user namespaces.

Acked-by: Serge Hallyn
Signed-off-by: Eric W. Biederman

Eric W. Biederman
2012-04-08 08:02:46 +0800

21 Mar, 2012

3 commits

e59cc473c trim includes in inode.c ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2012-03-21 09:29:51 +0800
68ac1234f switch touch_atime to struct path ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2012-03-21 09:29:41 +0800
9bcb4b733 vfs: turn generic_drop_inode() into static inline ... Browse Code »

Once upon a time it used to be much bigger, but these days there's
no point whatsoever keeping it in fs/inode.c, especially since
it's not even needed as initializer for ->drop_inode() - it's the
default and leaving ->drop_inode NULL will do just as well.

Signed-off-by: Al Viro

Al Viro
2012-03-21 09:29:33 +0800

11 Mar, 2012

2 commits

310fa7a36 restore smp_mb() in unlock_new_inode() ... Browse Code »

wait_on_inode() doesn't have ->i_lock

Signed-off-by: Al Viro

Al Viro
2012-03-11 06:07:28 +0800
978d6d8c4 vfs: Correctly set the dir i_mutex lockdep class ... Browse Code »

9a7aa12f3911853a introduced additional logic around setting the i_mutex
lockdep class for directory inodes. The idea was that some filesystems
may want their own special lockdep class for different directory
inodes and calling unlock_new_inode() should not clobber one of
those special classes.

I believe that the added conditional, around the *negated* return value
of lockdep_match_class(), caused directory inodes to be placed in the
wrong lockdep class.

inode_init_always() sets the i_mutex lockdep class with i_mutex_key for
all inodes. If the filesystem did not change the class during inode
initialization, then the conditional mentioned above was false and the
directory inode was incorrectly left in the non-directory lockdep class.
If the filesystem did set a special lockdep class, then the conditional
mentioned above was true and that class was clobbered with
i_mutex_dir_key.

This patch removes the negation from the conditional so that the i_mutex
lockdep class is properly set for directory inodes. Special classes are
preserved and directory inodes with unmodified classes are set with
i_mutex_dir_key.

Signed-off-by: Tyler Hicks
Reviewed-by: Jan Kara
Signed-off-by: Al Viro

Tyler Hicks
2012-03-11 05:05:38 +0800

14 Feb, 2012

1 commit

074b85175 vfs: fix panic in __d_lookup() with high dentry hashtable counts ... Browse Code »

When the number of dentry cache hash table entries gets too high
(2147483648 entries), as happens by default on a 16TB system, use of a
signed integer in the dcache_init() initialization loop prevents the
dentry_hashtable from getting initialized, causing a panic in
__d_lookup(). Fix this in dcache_init() and similar areas.

Signed-off-by: Dimitri Sivanich
Acked-by: David S. Miller
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Al Viro

Dimitri Sivanich
2012-02-14 09:45:38 +0800

18 Jan, 2012

1 commit

424a5334a vfs: remove printk from set_nlink() ... Browse Code »

Don't log a message for set_nlink(0).

Signed-off-by: Miklos Szeredi
Signed-off-by: Al Viro

Miklos Szeredi
2012-01-18 05:39:47 +0800

11 Jan, 2012

1 commit

5f8aefd44 mm: account reaped page cache on inode cache pruning ... Browse Code »

Inode cache pruning indirectly reclaims page-cache by invalidating mapping
pages. Let's account them into reclaim-state to notice this progress in
memory reclaimer.

Signed-off-by: Konstantin Khlebnikov
Cc: Dave Chinner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2012-01-11 08:30:42 +0800

07 Jan, 2012

1 commit

7ada4db88 vfs: count unlinked inodes ... Browse Code »

Add a new counter to the superblock that keeps track of unlinked but
not yet deleted inodes.

Do not WARN_ON if set_nlink is called with zero count, just do a
ratelimited printk. This happens on xfs and probably other
filesystems after an unclean shutdown when the filesystem reads inodes
which already have zero i_nlink. Reported by Christoph Hellwig.

Signed-off-by: Miklos Szeredi
Tested-by: Toshiyuki Okajima
Signed-off-by: Al Viro

Miklos Szeredi
2012-01-07 12:20:12 +0800

04 Jan, 2012

3 commits

62bb10917 switch inode_init_owner() to umode_t ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2012-01-04 11:55:11 +0800
6b520e056 vfs: fix the stupidity with i_dentry in inode destructors ... Browse Code »

Seeing that just about every destructor got that INIT_LIST_HEAD() copied into
it, there is no point whatsoever keeping this INIT_LIST_HEAD in inode_init_once();
the cost of taking it into inode_init_always() will be negligible for pipes
and sockets and negative for everything else. Not to mention the removal of
boilerplate code from ->destroy_inode() instances...

Signed-off-by: Al Viro

Al Viro
2012-01-04 11:52:40 +0800
2a79f17e4 vfs: mnt_drop_write_file() ... Browse Code »

new helper (wrapper around mnt_drop_write()) to be used in pair with
mnt_want_write_file().

Signed-off-by: Al Viro

Al Viro
2012-01-04 11:52:40 +0800

02 Nov, 2011

1 commit

a78ef704a vfs: protect i_nlink ... Browse Code »

Prevent direct modification of i_nlink by making it const and adding a
non-const __i_nlink alias.

Signed-off-by: Miklos Szeredi
Tested-by: Toshiyuki Okajima
Signed-off-by: Christoph Hellwig

Miklos Szeredi
2011-11-02 19:53:43 +0800

28 Oct, 2011

1 commit

62a3ddef6 vfs: fix spinning prevention in prune_icache_sb ... Browse Code »

We need to move the inode to the end of the list to actually make the
spinning prevention explained in the comment above it work. With a
plain list_move it will simply stay in place as we're always reclaiming
from the head of the list.

Signed-off-by: Christoph Hellwig

Christoph Hellwig
2011-10-28 20:58:55 +0800

26 Aug, 2011

1 commit

e096d0c7e lockdep: Add helper function for dir vs file i_mutex annotation ... Browse Code »

Purely in-memory filesystems do not use the inode hash as the dcache
tells us if an entry already exists. As a result, they do not call
unlock_new_inode, and thus directory inodes do not get put into a
different lockdep class for i_sem.

We need the different lockdep classes, because the locking order for
i_mutex is different for directory inodes and regular inodes. Directory
inodes can do "readdir()", which takes i_mutex *before* possibly taking
mm->mmap_sem (due to a page fault while copying the directory entry to
user space).

In contrast, regular inodes can be mmap'ed, which takes mm->mmap_sem
before accessing i_mutex.

The two cases can never happen for the same inode, so no real deadlock
can occur, but without the different lockdep classes, lockdep cannot
understand that. As a result, if CONFIG_DEBUG_LOCK_ALLOC is set, this
can lead to false positives from lockdep like below:

find/645 is trying to acquire lock:
(&mm->mmap_sem){++++++}, at: [] might_fault+0x5c/0xac

but task is already holding lock:
(&sb->s_type->i_mutex_key#15){+.+.+.}, at: []
vfs_readdir+0x5b/0xb4

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (&sb->s_type->i_mutex_key#15){+.+.+.}:
[] lock_acquire+0xbf/0x103
[] __mutex_lock_common+0x4c/0x361
[] mutex_lock_nested+0x40/0x45
[] hugetlbfs_file_mmap+0x82/0x110
[] mmap_region+0x258/0x432
[] do_mmap_pgoff+0x2ac/0x306
[] sys_mmap_pgoff+0x118/0x16a
[] sys_mmap+0x22/0x24
[] system_call_fastpath+0x16/0x1b

-> #0 (&mm->mmap_sem){++++++}:
[] __lock_acquire+0xa1a/0xcf7
[] lock_acquire+0xbf/0x103
[] might_fault+0x89/0xac
[] filldir+0x6f/0xc7
[] dcache_readdir+0x67/0x205
[] vfs_readdir+0x7b/0xb4
[] sys_getdents+0x7e/0xd1
[] system_call_fastpath+0x16/0x1b

This patch moves the directory vs file lockdep annotation into a helper
function that can be called by in-memory filesystems and has hugetlbfs
call it.

Signed-off-by: Josh Boyer
Acked-by: Peter Zijlstra
Signed-off-by: Linus Torvalds

Josh Boyer
2011-08-26 01:50:18 +0800

07 Aug, 2011

1 commit

3ddcd0569 vfs: optimize inode cache access patterns ... Browse Code »

The inode structure layout is largely random, and some of the vfs paths
really do care. The path lookup in particular is already quite D$
intensive, and profiles show that accessing the 'inode->i_op->xyz'
fields is quite costly.

We already optimized the dcache to not unnecessarily load the d_op
structure for members that are often NULL using the DCACHE_OP_xyz bits
in dentry->d_flags, and this does something very similar for the inode
ops that are used during pathname lookup.

It also re-orders the fields so that the fields accessed by 'stat' are
together at the beginning of the inode structure, and roughly in the
order accessed.

The effect of this seems to be in the 1-2% range for an empty kernel
"make -j" run (which is fairly kernel-intensive, mostly in filename
lookup), so it's visible. The numbers are fairly noisy, though, and
likely depend a lot on exact microarchitecture. So there's more tuning
to be done.

Signed-off-by: Linus Torvalds

Linus Torvalds
2011-08-07 13:53:23 +0800

01 Aug, 2011

3 commits

c4ae0c654 vfs: avoid call to inode_lru_list_del() if possible ... Browse Code »

inode_lru_list_del() is expensive because of per superblock lru locking,
while some inodes are not in lru list.

Adding a check in iput_final() can speedup pipe/sockets workloads on
SMP.

Signed-off-by: Eric Dumazet
Signed-off-by: Al Viro

Eric Dumazet
2011-08-01 13:41:17 +0800
f2ee7abf4 vfs: avoid taking inode_hash_lock on pipes and sockets ... Browse Code »

Some inodes (pipes, sockets, ...) are not hashed, no need to take
contended inode_hash_lock at dismantle time.

nice speedup on SMP machines on socket intensive workloads.

Signed-off-by: Eric Dumazet
Signed-off-by: Al Viro

Eric Dumazet
2011-08-01 13:41:17 +0800
b12362bdb vfs: conditionally call inode_wb_list_del() ... Browse Code »

Some inodes (pipes, sockets, ...) are not in bdi writeback list.

evict() can avoid calling inode_wb_list_del() and its expensive spinlock
by checking inode i_wb_list being empty or not.

At this point, no other cpu/user can concurrently manipulate this inode
i_wb_list

Signed-off-by: Eric Dumazet
Signed-off-by: Al Viro

Eric Dumazet
2011-08-01 13:41:17 +0800

27 Jul, 2011

3 commits

e371d46ae Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
merge fchmod() and fchmodat() guts, kill ancient broken kludge
xfs: fix misspelled S_IS...()
xfs: get rid of open-coded S_ISREG(), etc.
vfs: document locking requirements for d_move, __d_move and d_materialise_unique
omfs: fix (mode & S_IFDIR) abuse
btrfs: S_ISREG(mode) is not mode & S_IFREG...
ima: fmode_t misspelled as mode_t...
pci-label.c: size_t misspelled as mode_t
jffs2: S_ISLNK(mode & S_IFMT) is pointless
snd_msnd ->mode is fmode_t, not mode_t
v9fs_iop_get_acl: get rid of unused variable
vfs: dont chain pipe/anon/socket on superblock s_inodes list
Documentation: Exporting: update description of d_splice_alias
fs: add missing unlock in default_llseek()

Linus Torvalds
2011-07-27 09:30:20 +0800
f01ef569c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback: (27 commits)
mm: properly reflect task dirty limits in dirty_exceeded logic
writeback: don't busy retry writeback on new/freeing inodes
writeback: scale IO chunk size up to half device bandwidth
writeback: trace global_dirty_state
writeback: introduce max-pause and pass-good dirty limits
writeback: introduce smoothed global dirty limit
writeback: consolidate variable names in balance_dirty_pages()
writeback: show bdi write bandwidth in debugfs
writeback: bdi write bandwidth estimation
writeback: account per-bdi accumulated written pages
writeback: make writeback_control.nr_to_write straight
writeback: skip tmpfs early in balance_dirty_pages_ratelimited_nr()
writeback: trace event writeback_queue_io
writeback: trace event writeback_single_inode
writeback: remove .nonblocking and .encountered_congestion
writeback: remove writeback_control.more_io
writeback: skip balance_dirty_pages() for in-memory fs
writeback: add bdi_dirty_limit() kernel-doc
writeback: avoid extra sync work at enqueue time
writeback: elevate queue_io() into wb_writeback()
...

Fix up trivial conflicts in fs/fs-writeback.c and mm/filemap.c

Linus Torvalds
2011-07-27 01:39:54 +0800
a209dfc7b vfs: dont chain pipe/anon/socket on superblock s_inodes list ... Browse Code »

Workloads using pipes and sockets hit inode_sb_list_lock contention.

superblock s_inodes list is needed for quota, dirty, pagecache and
fsnotify management. pipe/anon/socket fs are clearly not candidates for
these.

Signed-off-by: Eric Dumazet
Reviewed-by: Christoph Hellwig
Signed-off-by: Al Viro

Eric Dumazet
2011-07-27 00:57:09 +0800

21 Jul, 2011

3 commits

bd5fe6c5e fs: kill i_alloc_sem ... Browse Code »

i_alloc_sem is a rather special rw_semaphore. It's the last one that may
be released by a non-owner, and it's write side is always mirrored by
real exclusion. It's intended use it to wait for all pending direct I/O
requests to finish before starting a truncate.

Replace it with a hand-grown construct:

- exclusion for truncates is already guaranteed by i_mutex, so it can
simply fall way
- the reader side is replaced by an i_dio_count member in struct inode
that counts the number of pending direct I/O requests. Truncate can't
proceed as long as it's non-zero
- when i_dio_count reaches non-zero we wake up a pending truncate using
wake_up_bit on a new bit in i_flags
- new references to i_dio_count can't appear while we are waiting for
it to read zero because the direct I/O count always needs i_mutex
(or an equivalent like XFS's i_iolock) for starting a new operation.

This scheme is much simpler, and saves the space of a spinlock_t and a
struct list_head in struct inode (typically 160 bits on a non-debug 64-bit
system).

Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro

Christoph Hellwig
2011-07-21 08:47:46 +0800
4f8c19fdf inode: remove iprune_sem ... Browse Code »

Now that we have per-sb shrinkers with a lifecycle that is a subset
of the superblock lifecycle and can reliably detect a filesystem
being unmounted, there is not longer any race condition for the
iprune_sem to protect against. Hence we can remove it.

Signed-off-by: Dave Chinner
Signed-off-by: Al Viro

Dave Chinner
2011-07-21 08:47:40 +0800
b0d40c92a superblock: introduce per-sb cache shrinker infrastructure ... Browse Code »

With context based shrinkers, we can implement a per-superblock
shrinker that shrinks the caches attached to the superblock. We
currently have global shrinkers for the inode and dentry caches that
split up into per-superblock operations via a coarse proportioning
method that does not batch very well. The global shrinkers also
have a dependency - dentries pin inodes - so we have to be very
careful about how we register the global shrinkers so that the
implicit call order is always correct.

With a per-sb shrinker callout, we can encode this dependency
directly into the per-sb shrinker, hence avoiding the need for
strictly ordering shrinker registrations. We also have no need for
any proportioning code for the shrinker subsystem already provides
this functionality across all shrinkers. Allowing the shrinker to
operate on a single superblock at a time means that we do less
superblock list traversals and locking and reclaim should batch more
effectively. This should result in less CPU overhead for reclaim and
potentially faster reclaim of items from each filesystem.

Signed-off-by: Dave Chinner
Signed-off-by: Al Viro

Dave Chinner
2011-07-21 08:47:10 +0800

20 Jul, 2011

4 commits

09cc9fc7a inode: move to per-sb LRU locks ... Browse Code »

With the inode LRUs moving to per-sb structures, there is no longer
a need for a global inode_lru_lock. The locking can be made more
fine-grained by moving to a per-sb LRU lock, isolating the LRU
operations of different filesytsems completely from each other.

Signed-off-by: Dave Chinner
Signed-off-by: Al Viro

Dave Chinner
2011-07-20 13:44:36 +0800
98b745c64 inode: Make unused inode LRU per superblock ... Browse Code »

The inode unused list is currently a global LRU. This does not match
the other global filesystem cache - the dentry cache - which uses
per-superblock LRU lists. Hence we have related filesystem object
types using different LRU reclaimation schemes.

To enable a per-superblock filesystem cache shrinker, both of these
caches need to have per-sb unused object LRU lists. Hence this patch
converts the global inode LRU to per-sb LRUs.

The patch only does rudimentary per-sb propotioning in the shrinker
infrastructure, as this gets removed when the per-sb shrinker
callouts are introduced later on.

Signed-off-by: Dave Chinner
Signed-off-by: Al Viro

Dave Chinner
2011-07-20 13:44:35 +0800
fcb94f72d inode: convert inode_stat.nr_unused to per-cpu counters ... Browse Code »

Before we split up the inode_lru_lock, the unused inode counter
needs to be made independent of the global inode_lru_lock. Convert
it to per-cpu counters to do this.

Signed-off-by: Dave Chinner
Signed-off-by: Al Viro

Dave Chinner
2011-07-20 13:44:34 +0800
e7f590970 kill useless checks for sb->s_op == NULL ... Browse Code »

never is...

Signed-off-by: Al Viro

Al Viro
2011-07-20 13:44:21 +0800

28 Jun, 2011

1 commit

08142579b mm: fix assertion mapping->nrpages == 0 in end_writeback() ... Browse Code »

Under heavy memory and filesystem load, users observe the assertion
mapping->nrpages == 0 in end_writeback() trigger. This can be caused by
page reclaim reclaiming the last page from a mapping in the following
race:

CPU0 CPU1
...
shrink_page_list()
__remove_mapping()
__delete_from_page_cache()
radix_tree_delete()
evict_inode()
truncate_inode_pages()
truncate_inode_pages_range()
pagevec_lookup() - finds nothing
end_writeback()
mapping->nrpages != 0 -> BUG
page->mapping = NULL
mapping->nrpages--

Fix the problem by doing a reliable check of mapping->nrpages under
mapping->tree_lock in end_writeback().

Analyzed by Jay , lost in LKML, and dug out
by Miklos Szeredi .

Cc: Jay
Cc: Miklos Szeredi
Signed-off-by: Jan Kara
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Kara
2011-06-28 09:00:13 +0800

08 Jun, 2011

1 commit

f758eeabe writeback: split inode_wb_list_lock into bdi_writeback.list_lock ... Browse Code »

Split the global inode_wb_list_lock into a per-bdi_writeback list_lock,
as it's currently the most contended lock in the system for metadata
heavy workloads. It won't help for single-filesystem workloads for
which we'll need the I/O-less balance_dirty_pages, but at least we
can dedicate a cpu to spinning on each bdi now for larger systems.

Based on earlier patches from Nick Piggin and Dave Chinner.

It reduces lock contentions to 1/4 in this test case:
10 HDD JBOD, 100 dd on each disk, XFS, 6GB ram

lock_stat version 0.3
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name con-bounces contentions waittime-min waittime-max waittime-total acq-bounces acquisitions holdtime-min holdtime-max holdtime-total
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
vanilla 2.6.39-rc3:
inode_wb_list_lock: 42590 44433 0.12 147.74 144127.35 252274 886792 0.08 121.34 917211.23
------------------
inode_wb_list_lock 2 [] bdev_inode_switch_bdi+0x29/0x85
inode_wb_list_lock 34 [] inode_wb_list_del+0x22/0x49
inode_wb_list_lock 12893 [] __mark_inode_dirty+0x170/0x1d0
inode_wb_list_lock 10702 [] writeback_single_inode+0x16d/0x20a
------------------
inode_wb_list_lock 2 [] bdev_inode_switch_bdi+0x29/0x85
inode_wb_list_lock 19 [] inode_wb_list_del+0x22/0x49
inode_wb_list_lock 5550 [] __mark_inode_dirty+0x170/0x1d0
inode_wb_list_lock 8511 [] writeback_sb_inodes+0x10f/0x157

2.6.39-rc3 + patch:
&(&wb->list_lock)->rlock: 11383 11657 0.14 151.69 40429.51 90825 527918 0.11 145.90 556843.37
------------------------
&(&wb->list_lock)->rlock 10 [] inode_wb_list_del+0x5f/0x86
&(&wb->list_lock)->rlock 1493 [] writeback_inodes_wb+0x3d/0x150
&(&wb->list_lock)->rlock 3652 [] writeback_sb_inodes+0x123/0x16f
&(&wb->list_lock)->rlock 1412 [] writeback_single_inode+0x17f/0x223
------------------------
&(&wb->list_lock)->rlock 3 [] bdi_lock_two+0x46/0x4b
&(&wb->list_lock)->rlock 6 [] inode_wb_list_del+0x5f/0x86
&(&wb->list_lock)->rlock 2061 [] __mark_inode_dirty+0x173/0x1cf
&(&wb->list_lock)->rlock 2629 [] writeback_sb_inodes+0x123/0x16f

hughd@google.com: fix recursive lock when bdi_lock_two() is called with new the same as old
akpm@linux-foundation.org: cleanup bdev_inode_switch_bdi() comment

Signed-off-by: Christoph Hellwig
Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Wu Fengguang

Christoph Hellwig
2011-06-08 08:25:21 +0800

27 May, 2011

1 commit

4b4563dc8 fs: cosmetic inode.c cleanups ... Browse Code »

Move the lock order description after all the includes, remove several
fairly outdated and/or incorrect comments, move Andrea's
copyright/changelog to the top where it belongs, remove the pointless
filename in the top of the file comment, and remove to useless macros.

Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro

Christoph Hellwig
2011-05-27 21:43:00 +0800