Eric Lee / smarc-fsl-linux-kernel

07 Mar, 2010

7 commits

5beb49305 mm: change anon_vma linking to fix multi-process server scalability issue ... Browse Code »

The old anon_vma code can lead to scalability issues with heavily forking
workloads. Specifically, each anon_vma will be shared between the parent
process and all its child processes.

In a workload with 1000 child processes and a VMA with 1000 anonymous
pages per process that get COWed, this leads to a system with a million
anonymous pages in the same anon_vma, each of which is mapped in just one
of the 1000 processes. However, the current rmap code needs to walk them
all, leading to O(N) scanning complexity for each page.

This can result in systems where one CPU is walking the page tables of
1000 processes in page_referenced_one, while all other CPUs are stuck on
the anon_vma lock. This leads to catastrophic failure for a benchmark
like AIM7, where the total number of processes can reach in the tens of
thousands. Real workloads are still a factor 10 less process intensive
than AIM7, but they are catching up.

This patch changes the way anon_vmas and VMAs are linked, which allows us
to associate multiple anon_vmas with a VMA. At fork time, each child
process gets its own anon_vmas, in which its COWed pages will be
instantiated. The parents' anon_vma is also linked to the VMA, because
non-COWed pages could be present in any of the children.

This reduces rmap scanning complexity to O(1) for the pages of the 1000
child processes, with O(N) complexity for at most 1/N pages in the system.
This reduces the average scanning cost in heavily forking workloads from
O(N) to 2.

The only real complexity in this patch stems from the fact that linking a
VMA to anon_vmas now involves memory allocations. This means vma_adjust
can fail, if it needs to attach a VMA to anon_vma structures. This in
turn means error handling needs to be added to the calling functions.

A second source of complexity is that, because there can be multiple
anon_vmas, the anon_vma linking in vma_adjust can no longer be done under
"the" anon_vma lock. To prevent the rmap code from walking up an
incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag. This bit
flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h
to make sure it is impossible to compile a kernel that needs both symbolic
values for the same bitflag.

Some test results:

Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test
box with 16GB RAM and not quite enough IO), the system ends up running
>99% in system time, with every CPU on the same anon_vma lock in the
pageout code.

With these changes, AIM7 hits the cross-over point around 29.7k users.
This happens with ~99% IO wait time, there never seems to be any spike in
system time. The anon_vma lock contention appears to be resolved.

[akpm@linux-foundation.org: cleanups]
Signed-off-by: Rik van Riel
Cc: KOSAKI Motohiro
Cc: Larry Woodman
Cc: Lee Schermerhorn
Cc: Minchan Kim
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rik van Riel
2010-03-07 03:26:26 +0800
42e496086 vfs: take f_lock on modifying f_mode after open time ... Browse Code »

We'll introduce FMODE_RANDOM which will be runtime modified. So protect
all runtime modification to f_mode with f_lock to avoid races.

Signed-off-by: Wu Fengguang
Cc: Al Viro
Cc: Christoph Hellwig
Cc: Trond Myklebust
Cc: Chuck Lever
Cc: [2.6.33.x]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wu Fengguang
2010-03-07 03:26:25 +0800
b084d4353 mm: count swap usage ... Browse Code »

A frequent questions from users about memory management is what numbers of
swap ents are user for processes. And this information will give some
hints to oom-killer.

Besides we can count the number of swapents per a process by scanning
/proc//smaps, this is very slow and not good for usual process
information handler which works like 'ps' or 'top'. (ps or top is now
enough slow..)

This patch adds a counter of swapents to mm_counter and update is at each
swap events. Information is exported via /proc//status file as

[kamezawa@bluextal memory]$ cat /proc/self/status
Name: cat
State: R (running)
Tgid: 2910
Pid: 2910
PPid: 2823
TracerPid: 0
Uid: 500 500 500 500
Gid: 500 500 500 500
FDSize: 256
Groups: 500
VmPeak: 82696 kB
VmSize: 82696 kB
VmLck: 0 kB
VmHWM: 432 kB
VmRSS: 432 kB
VmData: 172 kB
VmStk: 84 kB
VmExe: 48 kB
VmLib: 1568 kB
VmPTE: 40 kB
VmSwap: 0 kB
Reviewed-by: Minchan Kim
Reviewed-by: Christoph Lameter
Cc: Lee Schermerhorn
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2010-03-07 03:26:24 +0800
34e55232e mm: avoid false sharing of mm_counter ... Browse Code »

Considering the nature of per mm stats, it's the shared object among
threads and can be a cache-miss point in the page fault path.

This patch adds per-thread cache for mm_counter. RSS value will be
counted into a struct in task_struct and synchronized with mm's one at
events.

Now, in this patch, the event is the number of calls to handle_mm_fault.
Per-thread value is added to mm at each 64 calls.

rough estimation with small benchmark on parallel thread (2threads) shows
[before]
4.5 cache-miss/faults
[after]
4.0 cache-miss/faults
Anyway, the most contended object is mmap_sem if the number of threads grows.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: KAMEZAWA Hiroyuki
Cc: Minchan Kim
Cc: Christoph Lameter
Cc: Lee Schermerhorn
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2010-03-07 03:26:24 +0800
d559db086 mm: clean up mm_counter ... Browse Code »

Presently, per-mm statistics counter is defined by macro in sched.h

This patch modifies it to
- defined in mm.h as inlinf functions
- use array instead of macro's name creation.

This patch is for reducing patch size in future patch to modify
implementation of per-mm counter.

Signed-off-by: KAMEZAWA Hiroyuki
Reviewed-by: Minchan Kim
Cc: Christoph Lameter
Cc: Lee Schermerhorn
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2010-03-07 03:26:23 +0800
984b3f574 bitops: rename for_each_bit() to for_each_set_bit() ... Browse Code »

Rename for_each_bit to for_each_set_bit in the kernel source tree. To
permit for_each_clear_bit(), should that ever be added.

The patch includes a macro to map the old for_each_bit() onto the new
for_each_set_bit(). This is a (very) temporary thing to ease the migration.

[akpm@linux-foundation.org: add temporary for_each_bit()]
Suggested-by: Alexey Dobriyan
Suggested-by: Andrew Morton
Signed-off-by: Akinobu Mita
Cc: "David S. Miller"
Cc: Russell King
Cc: David Woodhouse
Cc: Artem Bityutskiy
Cc: Stephen Rothwell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Akinobu Mita
2010-03-07 03:26:23 +0800
781b16775 Fix a dumb typo - use of & instead of && ... Browse Code »

We managed to lose O_DIRECTORY testing due to a stupid typo in commit
1f36f774b2 ("Switch !O_CREAT case to use of do_last()")

Reported-by: Walter Sheets
Signed-off-by: Al Viro
Signed-off-by: Linus Torvalds

Al Viro
2010-03-07 02:54:48 +0800

06 Mar, 2010

23 commits

cc7889ff5 Merge branch 'nfs-for-2.6.34' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6 ... Browse Code »

* 'nfs-for-2.6.34' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6: (44 commits)
NFS: Remove requirement for inode->i_mutex from nfs_invalidate_mapping
NFS: Clean up nfs_sync_mapping
NFS: Simplify nfs_wb_page()
NFS: Replace __nfs_write_mapping with sync_inode()
NFS: Simplify nfs_wb_page_cancel()
NFS: Ensure inode is always marked I_DIRTY_DATASYNC, if it has unstable pages
NFS: Run COMMIT as an asynchronous RPC call when wbc->for_background is set
NFS: Reduce the number of unnecessary COMMIT calls
NFS: Add a count of the number of unstable writes carried by an inode
NFS: Cleanup - move nfs_write_inode() into fs/nfs/write.c
nfs41 fix NFS4ERR_CLID_INUSE for exchange id
NFS: Fix an allocation-under-spinlock bug
SUNRPC: Handle EINVAL error returns from the TCP connect operation
NFSv4.1: Various fixes to the sequence flag error handling
nfs4: renewd renew operations should take/put a client reference
nfs41: renewd sequence operations should take/put client reference
nfs: prevent backlogging of renewd requests
nfs: kill renewd before clearing client minor version
NFS: Make close(2) asynchronous when closing NFS O_DIRECT files
NFS: Improve NFS iostat byte count accuracy for writes
...

Linus Torvalds
2010-03-06 05:25:45 +0800
b13d3c6e8 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs:
fs/9p: Add hardlink support to .u extension
9P2010.L handshake: .L protocol negotiation
9P2010.L handshake: Remove "dotu" variable
9P2010.L handshake: Add mount option
9P2010.L handshake: Add VFS flags
net/9p: Handle mount errors correctly.
net/9p: Remove MAX_9P_CHAN limit
net/9p: Add multi channel support.

Linus Torvalds
2010-03-06 05:25:24 +0800
e213e26ab Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6 ... Browse Code »

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6: (33 commits)
quota: stop using QUOTA_OK / NO_QUOTA
dquot: cleanup dquot initialize routine
dquot: move dquot initialization responsibility into the filesystem
dquot: cleanup dquot drop routine
dquot: move dquot drop responsibility into the filesystem
dquot: cleanup dquot transfer routine
dquot: move dquot transfer responsibility into the filesystem
dquot: cleanup inode allocation / freeing routines
dquot: cleanup space allocation / freeing routines
ext3: add writepage sanity checks
ext3: Truncate allocated blocks if direct IO write fails to update i_size
quota: Properly invalidate caches even for filesystems with blocksize < pagesize
quota: generalize quota transfer interface
quota: sb_quota state flags cleanup
jbd: Delay discarding buffers in journal_unmap_buffer
ext3: quota_write cross block boundary behaviour
quota: drop permission checks from xfs_fs_set_xstate/xfs_fs_set_xquota
quota: split out compat_sys_quotactl support from quota.c
quota: split out netlink notification support from quota.c
quota: remove invalid optimization from quota_sync_all
...

Fixed trivial conflicts in fs/namei.c and fs/ufs/inode.c

Linus Torvalds
2010-03-06 05:20:53 +0800
5717144a0 fs/9p: Add hardlink support to .u extension ... Browse Code »

For regular file and directories we put the link
count in th extension field in a tagged string format.

Signed-off-by: Aneesh Kumar K.V
Signed-off-by: Eric Van Hensbergen

Aneesh Kumar K.V
2010-03-06 05:04:42 +0800
342fee1d5 9P2010.L handshake: Remove "dotu" variable ... Browse Code »

Removes 'dotu' variable and make everything dependent
on 'proto_version' field.

Signed-off-by: Sripathi Kodi
Signed-off-by: Eric Van Hensbergen

Sripathi Kodi
2010-03-06 05:04:42 +0800
dd6102fbd 9P2010.L handshake: Add VFS flags ... Browse Code »

Add 9P2000.u and 9P2010.L protocol flags to V9FS VFS

This patch adds 9P2000.u and 9P2010.L protocol flags into V9FS VFS side code
and removes the single flag used for 'extended'.

Signed-off-by: Sripathi Kodi
Signed-off-by: Eric Van Hensbergen

Sripathi Kodi
2010-03-06 05:04:41 +0800
3fa04ecd7 Merge branch 'writeback-for-2.6.34' into nfs-for-2.6.34 Browse Code »

Trond Myklebust
2010-03-06 04:46:18 +0800
1cda707d5 NFS: Remove requirement for inode->i_mutex from nfs_invalidate_mapping ... Browse Code »

Signed-off-by: Trond Myklebust

Trond Myklebust
2010-03-06 04:44:56 +0800
5cf95214c NFS: Clean up nfs_sync_mapping ... Browse Code »

Remove the redundant call to filemap_write_and_wait().

Signed-off-by: Trond Myklebust

Trond Myklebust
2010-03-06 04:44:56 +0800
7f2f12d96 NFS: Simplify nfs_wb_page() ... Browse Code »
44

Signed-off-by: Trond Myklebust

Trond Myklebust
2010-03-06 04:44:55 +0800
acdc53b21 NFS: Replace __nfs_write_mapping with sync_inode() ... Browse Code »

Now that we have correct COMMIT semantics in writeback_single_inode, we can
reduce and simplify nfs_wb_all(). Also replace nfs_wb_nocommit() with a
call to filemap_write_and_wait(), which doesn't need to hold the
inode->i_mutex.

With that done, we can eliminate nfs_write_mapping() altogether.

Signed-off-by: Trond Myklebust

Trond Myklebust
2010-03-06 04:44:55 +0800
c988950eb NFS: Simplify nfs_wb_page_cancel() ... Browse Code »

In all cases we should be able to just remove the request and call
cancel_dirty_page().

Signed-off-by: Trond Myklebust

Trond Myklebust
2010-03-06 04:44:55 +0800
2928db1ff NFS: Ensure inode is always marked I_DIRTY_DATASYNC, if it has unstable pages ... Browse Code »

Since nfs_scan_list() doesn't wait for locked pages, we have a race in
which it is possible to end up with an inode that needs to send a COMMIT,
but which does not have the I_DIRTY_DATASYNC flag set.

Signed-off-by: Trond Myklebust

Trond Myklebust
2010-03-06 04:44:54 +0800
5bad5abec NFS: Run COMMIT as an asynchronous RPC call when wbc->for_background is set ... Browse Code »

Signed-off-by: Trond Myklebust
Acked-by: Peter Zijlstra
Acked-by: Wu Fengguang

Trond Myklebust
2010-03-06 04:44:54 +0800
420e3646b NFS: Reduce the number of unnecessary COMMIT calls ... Browse Code »
2

If the caller is doing a non-blocking flush, and there are still writebacks
pending on the wire, we can usually defer the COMMIT call until those
writes are done.

Also ensure that we honour the wbc->nonblocking flag.

Signed-off-by: Trond Myklebust

Trond Myklebust
2010-03-06 04:44:54 +0800
ff778d02b NFS: Add a count of the number of unstable writes carried by an inode ... Browse Code »

In order to know when we should do opportunistic commits of the unstable
writes, when the VM is doing a background flush, we add a field to count
the number of unstable writes.

Signed-off-by: Trond Myklebust

Trond Myklebust
2010-03-06 04:44:54 +0800
8fc795f70 NFS: Cleanup - move nfs_write_inode() into fs/nfs/write.c ... Browse Code »

The sole purpose of nfs_write_inode is to commit unstable writes, so
move it into fs/nfs/write.c, and make nfs_commit_inode static.

Signed-off-by: Trond Myklebust

Trond Myklebust
2010-03-06 04:44:53 +0800
9467c4fdd Merge branch 'write_inode2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 ... Browse Code »

* 'write_inode2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
pass writeback_control to ->write_inode
make sure data is on disk before calling ->write_inode

Linus Torvalds
2010-03-06 03:53:53 +0800
35c2e967d Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
Switch !O_CREAT case to use of do_last()
Get rid of symlink body copying
Finish pulling of -ESTALE handling to upper level in do_filp_open()
Turn do_link spaghetty into a normal loop
Unify exits in O_CREAT handling
Kill is_link argument of do_last()
Pull handling of LAST_BIND into do_last(), clean up ok: part in do_filp_open()
Leave mangled flag only for setting nd.intent.open.flag
Get rid of passing mangled flag to do_last()
Don't pass mangled open_flag to finish_open()
pull more into do_last()
bail out with ELOOP earlier in do_link loop
pull the common predecessors into do_last()
postpone __putname() until after do_last()
unroll do_last: loop in do_filp_open()
Shift releasing nd->root from do_last() to its caller
gut do_filp_open() a bit more (do_last separation)
beginning to untangle do_filp_open()

Linus Torvalds
2010-03-06 03:46:31 +0800
1f63b9c15 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 ... Browse Code »

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (36 commits)
ext4: fix up rb_root initializations to use RB_ROOT
ext4: Code cleanup for EXT4_IOC_MOVE_EXT ioctl
ext4: Fix the NULL reference in double_down_write_data_sem()
ext4: Fix insertion point of extent in mext_insert_across_blocks()
ext4: consolidate in_range() definitions
ext4: cleanup to use ext4_grp_offs_to_block()
ext4: cleanup to use ext4_group_first_block_no()
ext4: Release page references acquired in ext4_da_block_invalidatepages
ext4: Fix ext4_quota_write cross block boundary behaviour
ext4: Convert BUG_ON checks to use ext4_error() instead
ext4: Use direct_IO_no_locking in ext4 dio read
ext4: use ext4_get_block_write in buffer write
ext4: mechanical rename some of the direct I/O get_block's identifiers
ext4: make "offset" consistent in ext4_check_dir_entry()
ext4: Handle non empty on-disk orphan link
ext4: explicitly remove inode from orphan list after failed direct io
ext4: fix error handling in migrate
ext4: deprecate obsoleted mount options
ext4: Fix fencepost error in chosing choosing group vs file preallocation.
jbd2: clean up an assertion in jbd2_journal_commit_transaction()
...

Linus Torvalds
2010-03-06 02:47:00 +0800
b24bc1e61 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-linus ... Browse Code »

* git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-linus:
Squashfs: get rid of obsolete definition in header file
Squashfs: get rid of obsolete variable in struct squashfs_sb_info
Squashfs: add decompressor entries for lzma and lzo
Squashfs: add a decompressor framework
Squashfs: factor out remaining zlib dependencies into separate wrapper file
Squashfs: move zlib decompression wrapper code into a separate file

Linus Torvalds
2010-03-06 02:46:04 +0800
a9185b41a pass writeback_control to ->write_inode ... Browse Code »

This gives the filesystem more information about the writeback that
is happening. Trond requested this for the NFS unstable write handling,
and other filesystems might benefit from this too by beeing able to
distinguish between the different callers in more detail.

Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro

Christoph Hellwig
2010-03-06 02:25:52 +0800
26821ed40 make sure data is on disk before calling ->write_inode ... Browse Code »

Similar to the fsync issue fixed a while ago in commit
2daea67e966dc0c42067ebea015ddac6834cef88 we need to write for data to
actually hit the disk before writing out the metadata to guarantee
data integrity for filesystems that modify the inode in the data I/O
completion path. Currently XFS and NFS handle this manually, and AFS
has a write_inode method that does nothing but waiting for data, while
others are possibly missing out on this.

Fortunately this change has a lot less impact than the fsync change
as none of the write_inode methods starts data writeout of any form
by itself.

Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro

Christoph Hellwig
2010-03-06 02:25:10 +0800

05 Mar, 2010

10 commits

06862f884 Squashfs: get rid of obsolete definition in header file ... Browse Code »

Signed-off-by: Phillip Lougher

Phillip Lougher
2010-03-05 23:35:35 +0800
ae4a3179b Squashfs: get rid of obsolete variable in struct squashfs_sb_info ... Browse Code »

Signed-off-by: Phillip Lougher

Phillip Lougher
2010-03-05 23:35:20 +0800
1f36f774b Switch !O_CREAT case to use of do_last() ... Browse Code »

... and now we have all intents crap well localized

Signed-off-by: Al Viro

Al Viro
2010-03-05 22:22:25 +0800
def4af30c Get rid of symlink body copying ... Browse Code »

Now that nd->last stays around until ->put_link() is called, we can
just postpone that ->put_link() in do_filp_open() a bit and don't
bother with copying.

Signed-off-by: Al Viro

Al Viro
2010-03-05 22:01:40 +0800
3866248e5 Finish pulling of -ESTALE handling to upper level in do_filp_open() ... Browse Code »

Don't bother with path_walk() (and its retry loop); link_path_walk()
will do it.

Signed-off-by: Al Viro

Al Viro
2010-03-05 22:01:38 +0800
806b681cb Turn do_link spaghetty into a normal loop ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2010-03-05 22:01:36 +0800
10fa8e62f Unify exits in O_CREAT handling ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2010-03-05 22:01:35 +0800
9e67f3616 Kill is_link argument of do_last() ... Browse Code »

We set it to 1 iff we return NULL

Signed-off-by: Al Viro

Al Viro
2010-03-05 22:01:33 +0800
67ee3ad21 Pull handling of LAST_BIND into do_last(), clean up ok: part in do_filp_open() ... Browse Code »

Note that in case of !O_CREAT we know that nd.root has already been given up

Signed-off-by: Al Viro

Al Viro
2010-03-05 22:01:31 +0800
4296e2cbf Leave mangled flag only for setting nd.intent.open.flag ... Browse Code »

Nothing else uses it anymore

Signed-off-by: Al Viro

Al Viro
2010-03-05 22:01:29 +0800