07 Mar, 2010

7 commits

  • The old anon_vma code can lead to scalability issues with heavily forking
    workloads. Specifically, each anon_vma will be shared between the parent
    process and all its child processes.

    In a workload with 1000 child processes and a VMA with 1000 anonymous
    pages per process that get COWed, this leads to a system with a million
    anonymous pages in the same anon_vma, each of which is mapped in just one
    of the 1000 processes. However, the current rmap code needs to walk them
    all, leading to O(N) scanning complexity for each page.

    This can result in systems where one CPU is walking the page tables of
    1000 processes in page_referenced_one, while all other CPUs are stuck on
    the anon_vma lock. This leads to catastrophic failure for a benchmark
    like AIM7, where the total number of processes can reach in the tens of
    thousands. Real workloads are still a factor 10 less process intensive
    than AIM7, but they are catching up.

    This patch changes the way anon_vmas and VMAs are linked, which allows us
    to associate multiple anon_vmas with a VMA. At fork time, each child
    process gets its own anon_vmas, in which its COWed pages will be
    instantiated. The parents' anon_vma is also linked to the VMA, because
    non-COWed pages could be present in any of the children.

    This reduces rmap scanning complexity to O(1) for the pages of the 1000
    child processes, with O(N) complexity for at most 1/N pages in the system.
    This reduces the average scanning cost in heavily forking workloads from
    O(N) to 2.

    The only real complexity in this patch stems from the fact that linking a
    VMA to anon_vmas now involves memory allocations. This means vma_adjust
    can fail, if it needs to attach a VMA to anon_vma structures. This in
    turn means error handling needs to be added to the calling functions.

    A second source of complexity is that, because there can be multiple
    anon_vmas, the anon_vma linking in vma_adjust can no longer be done under
    "the" anon_vma lock. To prevent the rmap code from walking up an
    incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag. This bit
    flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h
    to make sure it is impossible to compile a kernel that needs both symbolic
    values for the same bitflag.

    Some test results:

    Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test
    box with 16GB RAM and not quite enough IO), the system ends up running
    >99% in system time, with every CPU on the same anon_vma lock in the
    pageout code.

    With these changes, AIM7 hits the cross-over point around 29.7k users.
    This happens with ~99% IO wait time, there never seems to be any spike in
    system time. The anon_vma lock contention appears to be resolved.

    [akpm@linux-foundation.org: cleanups]
    Signed-off-by: Rik van Riel
    Cc: KOSAKI Motohiro
    Cc: Larry Woodman
    Cc: Lee Schermerhorn
    Cc: Minchan Kim
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • We'll introduce FMODE_RANDOM which will be runtime modified. So protect
    all runtime modification to f_mode with f_lock to avoid races.

    Signed-off-by: Wu Fengguang
    Cc: Al Viro
    Cc: Christoph Hellwig
    Cc: Trond Myklebust
    Cc: Chuck Lever
    Cc: [2.6.33.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • A frequent questions from users about memory management is what numbers of
    swap ents are user for processes. And this information will give some
    hints to oom-killer.

    Besides we can count the number of swapents per a process by scanning
    /proc//smaps, this is very slow and not good for usual process
    information handler which works like 'ps' or 'top'. (ps or top is now
    enough slow..)

    This patch adds a counter of swapents to mm_counter and update is at each
    swap events. Information is exported via /proc//status file as

    [kamezawa@bluextal memory]$ cat /proc/self/status
    Name: cat
    State: R (running)
    Tgid: 2910
    Pid: 2910
    PPid: 2823
    TracerPid: 0
    Uid: 500 500 500 500
    Gid: 500 500 500 500
    FDSize: 256
    Groups: 500
    VmPeak: 82696 kB
    VmSize: 82696 kB
    VmLck: 0 kB
    VmHWM: 432 kB
    VmRSS: 432 kB
    VmData: 172 kB
    VmStk: 84 kB
    VmExe: 48 kB
    VmLib: 1568 kB
    VmPTE: 40 kB
    VmSwap: 0 kB
    Reviewed-by: Minchan Kim
    Reviewed-by: Christoph Lameter
    Cc: Lee Schermerhorn
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Considering the nature of per mm stats, it's the shared object among
    threads and can be a cache-miss point in the page fault path.

    This patch adds per-thread cache for mm_counter. RSS value will be
    counted into a struct in task_struct and synchronized with mm's one at
    events.

    Now, in this patch, the event is the number of calls to handle_mm_fault.
    Per-thread value is added to mm at each 64 calls.

    rough estimation with small benchmark on parallel thread (2threads) shows
    [before]
    4.5 cache-miss/faults
    [after]
    4.0 cache-miss/faults
    Anyway, the most contended object is mmap_sem if the number of threads grows.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Minchan Kim
    Cc: Christoph Lameter
    Cc: Lee Schermerhorn
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Presently, per-mm statistics counter is defined by macro in sched.h

    This patch modifies it to
    - defined in mm.h as inlinf functions
    - use array instead of macro's name creation.

    This patch is for reducing patch size in future patch to modify
    implementation of per-mm counter.

    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Minchan Kim
    Cc: Christoph Lameter
    Cc: Lee Schermerhorn
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Rename for_each_bit to for_each_set_bit in the kernel source tree. To
    permit for_each_clear_bit(), should that ever be added.

    The patch includes a macro to map the old for_each_bit() onto the new
    for_each_set_bit(). This is a (very) temporary thing to ease the migration.

    [akpm@linux-foundation.org: add temporary for_each_bit()]
    Suggested-by: Alexey Dobriyan
    Suggested-by: Andrew Morton
    Signed-off-by: Akinobu Mita
    Cc: "David S. Miller"
    Cc: Russell King
    Cc: David Woodhouse
    Cc: Artem Bityutskiy
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • We managed to lose O_DIRECTORY testing due to a stupid typo in commit
    1f36f774b2 ("Switch !O_CREAT case to use of do_last()")

    Reported-by: Walter Sheets
    Signed-off-by: Al Viro
    Signed-off-by: Linus Torvalds

    Al Viro
     

06 Mar, 2010

23 commits

  • * 'nfs-for-2.6.34' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6: (44 commits)
    NFS: Remove requirement for inode->i_mutex from nfs_invalidate_mapping
    NFS: Clean up nfs_sync_mapping
    NFS: Simplify nfs_wb_page()
    NFS: Replace __nfs_write_mapping with sync_inode()
    NFS: Simplify nfs_wb_page_cancel()
    NFS: Ensure inode is always marked I_DIRTY_DATASYNC, if it has unstable pages
    NFS: Run COMMIT as an asynchronous RPC call when wbc->for_background is set
    NFS: Reduce the number of unnecessary COMMIT calls
    NFS: Add a count of the number of unstable writes carried by an inode
    NFS: Cleanup - move nfs_write_inode() into fs/nfs/write.c
    nfs41 fix NFS4ERR_CLID_INUSE for exchange id
    NFS: Fix an allocation-under-spinlock bug
    SUNRPC: Handle EINVAL error returns from the TCP connect operation
    NFSv4.1: Various fixes to the sequence flag error handling
    nfs4: renewd renew operations should take/put a client reference
    nfs41: renewd sequence operations should take/put client reference
    nfs: prevent backlogging of renewd requests
    nfs: kill renewd before clearing client minor version
    NFS: Make close(2) asynchronous when closing NFS O_DIRECT files
    NFS: Improve NFS iostat byte count accuracy for writes
    ...

    Linus Torvalds
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs:
    fs/9p: Add hardlink support to .u extension
    9P2010.L handshake: .L protocol negotiation
    9P2010.L handshake: Remove "dotu" variable
    9P2010.L handshake: Add mount option
    9P2010.L handshake: Add VFS flags
    net/9p: Handle mount errors correctly.
    net/9p: Remove MAX_9P_CHAN limit
    net/9p: Add multi channel support.

    Linus Torvalds
     
  • * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6: (33 commits)
    quota: stop using QUOTA_OK / NO_QUOTA
    dquot: cleanup dquot initialize routine
    dquot: move dquot initialization responsibility into the filesystem
    dquot: cleanup dquot drop routine
    dquot: move dquot drop responsibility into the filesystem
    dquot: cleanup dquot transfer routine
    dquot: move dquot transfer responsibility into the filesystem
    dquot: cleanup inode allocation / freeing routines
    dquot: cleanup space allocation / freeing routines
    ext3: add writepage sanity checks
    ext3: Truncate allocated blocks if direct IO write fails to update i_size
    quota: Properly invalidate caches even for filesystems with blocksize < pagesize
    quota: generalize quota transfer interface
    quota: sb_quota state flags cleanup
    jbd: Delay discarding buffers in journal_unmap_buffer
    ext3: quota_write cross block boundary behaviour
    quota: drop permission checks from xfs_fs_set_xstate/xfs_fs_set_xquota
    quota: split out compat_sys_quotactl support from quota.c
    quota: split out netlink notification support from quota.c
    quota: remove invalid optimization from quota_sync_all
    ...

    Fixed trivial conflicts in fs/namei.c and fs/ufs/inode.c

    Linus Torvalds
     
  • For regular file and directories we put the link
    count in th extension field in a tagged string format.

    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Eric Van Hensbergen

    Aneesh Kumar K.V
     
  • Removes 'dotu' variable and make everything dependent
    on 'proto_version' field.

    Signed-off-by: Sripathi Kodi
    Signed-off-by: Eric Van Hensbergen

    Sripathi Kodi
     
  • Add 9P2000.u and 9P2010.L protocol flags to V9FS VFS

    This patch adds 9P2000.u and 9P2010.L protocol flags into V9FS VFS side code
    and removes the single flag used for 'extended'.

    Signed-off-by: Sripathi Kodi
    Signed-off-by: Eric Van Hensbergen

    Sripathi Kodi
     
  • Trond Myklebust
     
  • Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • Remove the redundant call to filemap_write_and_wait().

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • Now that we have correct COMMIT semantics in writeback_single_inode, we can
    reduce and simplify nfs_wb_all(). Also replace nfs_wb_nocommit() with a
    call to filemap_write_and_wait(), which doesn't need to hold the
    inode->i_mutex.

    With that done, we can eliminate nfs_write_mapping() altogether.

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • In all cases we should be able to just remove the request and call
    cancel_dirty_page().

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • Since nfs_scan_list() doesn't wait for locked pages, we have a race in
    which it is possible to end up with an inode that needs to send a COMMIT,
    but which does not have the I_DIRTY_DATASYNC flag set.

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • Signed-off-by: Trond Myklebust
    Acked-by: Peter Zijlstra
    Acked-by: Wu Fengguang

    Trond Myklebust
     
  • If the caller is doing a non-blocking flush, and there are still writebacks
    pending on the wire, we can usually defer the COMMIT call until those
    writes are done.

    Also ensure that we honour the wbc->nonblocking flag.

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • In order to know when we should do opportunistic commits of the unstable
    writes, when the VM is doing a background flush, we add a field to count
    the number of unstable writes.

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • The sole purpose of nfs_write_inode is to commit unstable writes, so
    move it into fs/nfs/write.c, and make nfs_commit_inode static.

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • * 'write_inode2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
    pass writeback_control to ->write_inode
    make sure data is on disk before calling ->write_inode

    Linus Torvalds
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
    Switch !O_CREAT case to use of do_last()
    Get rid of symlink body copying
    Finish pulling of -ESTALE handling to upper level in do_filp_open()
    Turn do_link spaghetty into a normal loop
    Unify exits in O_CREAT handling
    Kill is_link argument of do_last()
    Pull handling of LAST_BIND into do_last(), clean up ok: part in do_filp_open()
    Leave mangled flag only for setting nd.intent.open.flag
    Get rid of passing mangled flag to do_last()
    Don't pass mangled open_flag to finish_open()
    pull more into do_last()
    bail out with ELOOP earlier in do_link loop
    pull the common predecessors into do_last()
    postpone __putname() until after do_last()
    unroll do_last: loop in do_filp_open()
    Shift releasing nd->root from do_last() to its caller
    gut do_filp_open() a bit more (do_last separation)
    beginning to untangle do_filp_open()

    Linus Torvalds
     
  • * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (36 commits)
    ext4: fix up rb_root initializations to use RB_ROOT
    ext4: Code cleanup for EXT4_IOC_MOVE_EXT ioctl
    ext4: Fix the NULL reference in double_down_write_data_sem()
    ext4: Fix insertion point of extent in mext_insert_across_blocks()
    ext4: consolidate in_range() definitions
    ext4: cleanup to use ext4_grp_offs_to_block()
    ext4: cleanup to use ext4_group_first_block_no()
    ext4: Release page references acquired in ext4_da_block_invalidatepages
    ext4: Fix ext4_quota_write cross block boundary behaviour
    ext4: Convert BUG_ON checks to use ext4_error() instead
    ext4: Use direct_IO_no_locking in ext4 dio read
    ext4: use ext4_get_block_write in buffer write
    ext4: mechanical rename some of the direct I/O get_block's identifiers
    ext4: make "offset" consistent in ext4_check_dir_entry()
    ext4: Handle non empty on-disk orphan link
    ext4: explicitly remove inode from orphan list after failed direct io
    ext4: fix error handling in migrate
    ext4: deprecate obsoleted mount options
    ext4: Fix fencepost error in chosing choosing group vs file preallocation.
    jbd2: clean up an assertion in jbd2_journal_commit_transaction()
    ...

    Linus Torvalds
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-linus:
    Squashfs: get rid of obsolete definition in header file
    Squashfs: get rid of obsolete variable in struct squashfs_sb_info
    Squashfs: add decompressor entries for lzma and lzo
    Squashfs: add a decompressor framework
    Squashfs: factor out remaining zlib dependencies into separate wrapper file
    Squashfs: move zlib decompression wrapper code into a separate file

    Linus Torvalds
     
  • This gives the filesystem more information about the writeback that
    is happening. Trond requested this for the NFS unstable write handling,
    and other filesystems might benefit from this too by beeing able to
    distinguish between the different callers in more detail.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Similar to the fsync issue fixed a while ago in commit
    2daea67e966dc0c42067ebea015ddac6834cef88 we need to write for data to
    actually hit the disk before writing out the metadata to guarantee
    data integrity for filesystems that modify the inode in the data I/O
    completion path. Currently XFS and NFS handle this manually, and AFS
    has a write_inode method that does nothing but waiting for data, while
    others are possibly missing out on this.

    Fortunately this change has a lot less impact than the fsync change
    as none of the write_inode methods starts data writeout of any form
    by itself.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     

05 Mar, 2010

10 commits