31 May, 2010

1 commit


28 May, 2010

1 commit

  • * git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (27 commits)
    Btrfs: add more error checking to btrfs_dirty_inode
    Btrfs: allow unaligned DIO
    Btrfs: drop verbose enospc printk
    Btrfs: Fix block generation verification race
    Btrfs: fix preallocation and nodatacow checks in O_DIRECT
    Btrfs: avoid ENOSPC errors in btrfs_dirty_inode
    Btrfs: move O_DIRECT space reservation to btrfs_direct_IO
    Btrfs: rework O_DIRECT enospc handling
    Btrfs: use async helpers for DIO write checksumming
    Btrfs: don't walk around with task->state != TASK_RUNNING
    Btrfs: do aio_write instead of write
    Btrfs: add basic DIO read/write support
    direct-io: do not merge logically non-contiguous requests
    direct-io: add a hook for the fs to provide its own submit_bio function
    fs: allow short direct-io reads to be completed via buffered IO
    Btrfs: Metadata ENOSPC handling for balance
    Btrfs: Pre-allocate space for data relocation
    Btrfs: Metadata ENOSPC handling for tree log
    Btrfs: Metadata reservation for orphan inodes
    Btrfs: Introduce global metadata reservation
    ...

    Linus Torvalds
     

27 May, 2010

1 commit

  • I/O errors can happen due to temporary failures, like multipath
    errors or losing network contact with the iSCSI server. Because
    of that, the VM will retry readpage on the page.

    However, do_generic_file_read does not clear PG_error. This
    causes the system to be unable to actually use the data in the
    page cache page, even if the subsequent readpage completes
    successfully!

    The function filemap_fault has had a ClearPageError before
    readpage forever. This patch simply adds the same to
    do_generic_file_read.

    Signed-off-by: Jeff Moyer
    Signed-off-by: Rik van Riel
    Acked-by: Larry Woodman
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Jeff Moyer
     

25 May, 2010

4 commits

  • Before applying this patch, cpuset updates task->mems_allowed and
    mempolicy by setting all new bits in the nodemask first, and clearing all
    old unallowed bits later. But in the way, the allocator may find that
    there is no node to alloc memory.

    The reason is that cpuset rebinds the task's mempolicy, it cleans the
    nodes which the allocater can alloc pages on, for example:

    (mpol: mempolicy)
    task1 task1's mpol task2
    alloc page 1
    alloc on node0? NO 1
    1 change mems from 1 to 0
    1 rebind task1's mpol
    0-1 set new bits
    0 clear disallowed bits
    alloc on node1? NO 0
    ...
    can't alloc page
    goto oom

    This patch fixes this problem by expanding the nodes range first(set newly
    allowed bits) and shrink it lazily(clear newly disallowed bits). So we
    use a variable to tell the write-side task that read-side task is reading
    nodemask, and the write-side task clears newly disallowed nodes after
    read-side task ends the current memory allocation.

    [akpm@linux-foundation.org: fix spello]
    Signed-off-by: Miao Xie
    Cc: David Rientjes
    Cc: Nick Piggin
    Cc: Paul Menage
    Cc: Lee Schermerhorn
    Cc: Hugh Dickins
    Cc: Ravikiran Thirumalai
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miao Xie
     
  • Shaohua Li reported parallel file copy on tmpfs can lead to OOM killer.
    This is regression of caused by commit 9ff473b9a7 ("vmscan: evict
    streaming IO first"). Wow, It is 2 years old patch!

    Currently, tmpfs file cache is inserted active list at first. This means
    that the insertion doesn't only increase numbers of pages in anon LRU, but
    it also reduces anon scanning ratio. Therefore, vmscan will get totally
    confused. It scans almost only file LRU even though the system has plenty
    unused tmpfs pages.

    Historically, lru_cache_add_active_anon() was used for two reasons.
    1) Intend to priotize shmem page rather than regular file cache.
    2) Intend to avoid reclaim priority inversion of used once pages.

    But we've lost both motivation because (1) Now we have separate anon and
    file LRU list. then, to insert active list doesn't help such priotize.
    (2) In past, one pte access bit will cause page activation. then to
    insert inactive list with pte access bit mean higher priority than to
    insert active list. Its priority inversion may lead to uninteded lru
    chun. but it was already solved by commit 645747462 (vmscan: detect
    mapped file pages used only once). (Thanks Hannes, you are great!)

    Thus, now we can use lru_cache_add_anon() instead.

    Signed-off-by: KOSAKI Motohiro
    Reported-by: Shaohua Li
    Reviewed-by: Wu Fengguang
    Reviewed-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Acked-by: Hugh Dickins
    Cc: Henrique de Moraes Holschuh
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • This is similar to what already happens in the write case. If we have a short
    read while doing O_DIRECT, instead of just returning, fallthrough and try to
    read the rest via buffered IO. BTRFS needs this because if we encounter a
    compressed or inline extent during DIO, we need to fallback on buffered. If the
    extent is compressed we need to read the entire thing into memory and
    de-compress it into the users pages. I have tested this with fsx and everything
    works great. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • This is needed to enable moving pages into the page cache in fuse with
    splice(..., SPLICE_F_MOVE).

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     

30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

07 Mar, 2010

1 commit

  • Make sure compiler won't do weird things with limits. E.g. fetching them
    twice may return 2 different values after writable limits are implemented.

    I.e. either use rlimit helpers added in
    3e10e716abf3c71bdb5d86b8f507f9e72236c9cd ("resource: add helpers for
    fetching rlimits") or ACCESS_ONCE if not applicable.

    Signed-off-by: Jiri Slaby
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Slaby
     

04 Mar, 2010

1 commit

  • No one is calling this anymore as everyone has switched to
    invalidate_mapping_pages long time ago. Also update a few
    references to it in comments. nfs has two more, but I can't
    easily figure what they are actually referring to, so I left
    them as-is.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     

03 Feb, 2010

1 commit

  • The cache alias problem will happen if the changes of user shared mapping
    is not flushed before copying, then user and kernel mapping may be mapped
    into two different cache line, it is impossible to guarantee the coherence
    after iov_iter_copy_from_user_atomic. So the right steps should be:

    flush_dcache_page(page);
    kmap_atomic(page);
    write to page;
    kunmap_atomic(page);
    flush_dcache_page(page);

    More precisely, we might create two new APIs flush_dcache_user_page and
    flush_dcache_kern_page to replace the two flush_dcache_page accordingly.

    Here is a snippet tested on omap2430 with VIPT cache, and I think it is
    not ARM-specific:

    int val = 0x11111111;
    fd = open("abc", O_RDWR);
    addr = mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
    *(addr+0) = 0x44444444;
    tmp = *(addr+0);
    *(addr+1) = 0x77777777;
    write(fd, &val, sizeof(int));
    close(fd);

    The results are not always 0x11111111 0x77777777 at the beginning as expected. Sometimes we see 0x44444444 0x77777777.

    Signed-off-by: Anfei
    Cc: Russell King
    Cc: Miklos Szeredi
    Cc: Nick Piggin
    Cc:
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    anfei zhou
     

28 Jan, 2010

1 commit

  • It's a simplified 'read_cache_page()' which takes a page allocation
    flag, so that different paths can control how aggressive the memory
    allocations are that populate a address space.

    In particular, the intel GPU object mapping code wants to be able to do
    a certain amount of own internal memory management by automatically
    shrinking the address space when memory starts getting tight. This
    allows it to dynamically use different memory allocation policies on a
    per-allocation basis, rather than depend on the (static) address space
    gfp policy.

    The actual new function is a one-liner, but re-organizing the helper
    functions to the point where you can do this with a single line of code
    is what most of the patch is all about.

    Tested-by: Chris Wilson
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

17 Dec, 2009

1 commit

  • In the case of direct I/O falling back to buffered I/O we sync data
    twice currently: once at the end of generic_file_buffered_write using
    filemap_write_and_wait_range and once a little later in
    __generic_file_aio_write using do_sync_mapping_range with all flags set.

    The wait before write of the do_sync_mapping_range call does not make
    any sense, so just keep the filemap_write_and_wait_range call and move
    it to the right spot.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     

10 Dec, 2009

1 commit


04 Dec, 2009

1 commit

  • That is "success", "unknown", "through", "performance", "[re|un]mapping"
    , "access", "default", "reasonable", "[con]currently", "temperature"
    , "channel", "[un]used", "application", "example","hierarchy", "therefore"
    , "[over|under]flow", "contiguous", "threshold", "enough" and others.

    Signed-off-by: André Goddard Rosa
    Signed-off-by: Jiri Kosina

    André Goddard Rosa
     

28 Sep, 2009

1 commit


24 Sep, 2009

3 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
    truncate: use new helpers
    truncate: new helpers
    fs: fix overflow in sys_mount() for in-kernel calls
    fs: Make unload_nls() NULL pointer safe
    freeze_bdev: grab active reference to frozen superblocks
    freeze_bdev: kill bd_mount_sem
    exofs: remove BKL from super operations
    fs/romfs: correct error-handling code
    vfs: seq_file: add helpers for data filling
    vfs: remove redundant position check in do_sendfile
    vfs: change sb->s_maxbytes to a loff_t
    vfs: explicitly cast s_maxbytes in fiemap_check_ranges
    libfs: return error code on failed attr set
    seq_file: return a negative error code when seq_path_root() fails.
    vfs: optimize touch_time() too
    vfs: optimization for touch_atime()
    vfs: split generic_forget_inode() so that hugetlbfs does not have to copy it
    fs/inode.c: add dev-id and inode number for debugging in init_special_inode()
    libfs: make simple_read_from_buffer conventional

    Linus Torvalds
     
  • * 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (21 commits)
    HWPOISON: Enable error_remove_page on btrfs
    HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs
    HWPOISON: Add madvise() based injector for hardware poisoned pages v4
    HWPOISON: Enable error_remove_page for NFS
    HWPOISON: Enable .remove_error_page for migration aware file systems
    HWPOISON: The high level memory error handler in the VM v7
    HWPOISON: Add PR_MCE_KILL prctl to control early kill behaviour per process
    HWPOISON: shmem: call set_page_dirty() with locked page
    HWPOISON: Define a new error_remove_page address space op for async truncation
    HWPOISON: Add invalidate_inode_page
    HWPOISON: Refactor truncate to allow direct truncating of page v2
    HWPOISON: check and isolate corrupted free pages v2
    HWPOISON: Handle hardware poisoned pages in try_to_unmap
    HWPOISON: Use bitmask/action code for try_to_unmap behaviour
    HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler v2
    HWPOISON: Add poison check to page fault handling
    HWPOISON: Add basic support for poisoned pages in fault handler v3
    HWPOISON: Add new SIGBUS error codes for hardware poison signals
    HWPOISON: Add support for poison swap entries v2
    HWPOISON: Export some rmap vma locking to outside world
    ...

    Linus Torvalds
     
  • Introduce new truncate helpers truncate_pagecache and inode_newsize_ok.
    vmtruncate is also consolidated from mm/memory.c and mm/nommu.c and
    into mm/truncate.c.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Nick Piggin
    Signed-off-by: Al Viro

    npiggin@suse.de
     

22 Sep, 2009

1 commit

  • Recently we encountered OOM problems due to memory use of the GEM cache.
    Generally a large amuont of Shmem/Tmpfs pages tend to create a memory
    shortage problem.

    We often use the following calculation to determine the amount of shmem
    pages:

    shmem = NR_ACTIVE_ANON + NR_INACTIVE_ANON - NR_ANON_PAGES

    however the expression does not consider isolated and mlocked pages.

    This patch adds explicit accounting for pages used by shmem and tmpfs.

    Signed-off-by: KOSAKI Motohiro
    Acked-by: Rik van Riel
    Reviewed-by: Christoph Lameter
    Acked-by: Wu Fengguang
    Cc: David Rientjes
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

16 Sep, 2009

1 commit

  • Add the high level memory handler that poisons pages
    that got corrupted by hardware (typically by a two bit flip in a DIMM
    or a cache) on the Linux level. The goal is to prevent everyone
    from accessing these pages in the future.

    This done at the VM level by marking a page hwpoisoned
    and doing the appropriate action based on the type of page
    it is.

    The code that does this is portable and lives in mm/memory-failure.c

    To quote the overview comment:

    High level machine check handler. Handles pages reported by the
    hardware as being corrupted usually due to a 2bit ECC memory or cache
    failure.

    This focuses on pages detected as corrupted in the background.
    When the current CPU tries to consume corruption the currently
    running process can just be killed directly instead. This implies
    that if the error cannot be handled for some reason it's safe to
    just ignore it because no corruption has been consumed yet. Instead
    when that happens another machine check will happen.

    Handles page cache pages in various states. The tricky part
    here is that we can access any page asynchronous to other VM
    users, because memory failures could happen anytime and anywhere,
    possibly violating some of their assumptions. This is why this code
    has to be extremely careful. Generally it tries to use normal locking
    rules, as in get the standard locks, even if that means the
    error handling takes potentially a long time.

    Some of the operations here are somewhat inefficient and have non
    linear algorithmic complexity, because the data structures have not
    been optimized for this case. This is in particular the case
    for the mapping from a vma to a process. Since this case is expected
    to be rare we hope we can get away with this.

    There are in principle two strategies to kill processes on poison:
    - just unmap the data and wait for an actual reference before
    killing
    - kill as soon as corruption is detected.
    Both have advantages and disadvantages and should be used
    in different situations. Right now both are implemented and can
    be switched with a new sysctl vm.memory_failure_early_kill
    The default is early kill.

    The patch does some rmap data structure walking on its own to collect
    processes to kill. This is unusual because normally all rmap data structure
    knowledge is in rmap.c only. I put it here for now to keep
    everything together and rmap knowledge has been seeping out anyways

    Includes contributions from Johannes Weiner, Chris Mason, Fengguang Wu,
    Nick Piggin (who did a lot of great work) and others.

    Cc: npiggin@suse.de
    Cc: riel@redhat.com
    Signed-off-by: Andi Kleen
    Acked-by: Rik van Riel
    Reviewed-by: Hidehiro Kawai

    Andi Kleen
     

14 Sep, 2009

6 commits

  • Remove these three functions since nobody uses them anymore.

    Signed-off-by: Jan Kara

    Jan Kara
     
  • Introduce new function for generic inode syncing (vfs_fsync_range) and use
    it from fsync() path. Introduce also new helper for syncing after a sync
    write (generic_write_sync) using the generic function.

    Use these new helpers for syncing from generic VFS functions. This makes
    O_SYNC writes to block devices acquire i_mutex for syncing. If we really
    care about this, we can make block_fsync() drop the i_mutex and reacquire
    it before it returns.

    CC: Evgeniy Polyakov
    CC: ocfs2-devel@oss.oracle.com
    CC: Joel Becker
    CC: Felix Blyakher
    CC: xfs@oss.sgi.com
    CC: Anton Altaparmakov
    CC: linux-ntfs-dev@lists.sourceforge.net
    CC: OGAWA Hirofumi
    CC: linux-ext4@vger.kernel.org
    CC: tytso@mit.edu
    Acked-by: Christoph Hellwig
    Signed-off-by: Jan Kara

    Jan Kara
     
  • generic_file_aio_write_nolock() is now used only by block devices and raw
    character device. Filesystems should use __generic_file_aio_write() in case
    generic_file_aio_write() doesn't suit them. So rename the function to
    blkdev_aio_write() and move it to fs/blockdev.c.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jan Kara

    Christoph Hellwig
     
  • generic_file_direct_write() and generic_file_buffered_write() called
    generic_osync_inode() if it was called on O_SYNC file or IS_SYNC inode. But
    this is superfluous since generic_file_aio_write() does the syncing as well.
    Also XFS and OCFS2 which call these functions directly handle syncing
    themselves. So let's have a single place where syncing happens:
    generic_file_aio_write().

    We slightly change the behavior by syncing only the range of file to which the
    write happened for buffered writes but that should be all that is required.

    CC: ocfs2-devel@oss.oracle.com
    CC: Joel Becker
    CC: Felix Blyakher
    CC: xfs@oss.sgi.com
    Signed-off-by: Jan Kara

    Jan Kara
     
  • Rename __generic_file_aio_write_nolock() to __generic_file_aio_write(), add
    comments to write helpers explaining how they should be used and export
    __generic_file_aio_write() since it will be used by some filesystems.

    CC: ocfs2-devel@oss.oracle.com
    CC: Joel Becker
    Acked-by: Evgeniy Polyakov
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jan Kara

    Jan Kara
     
  • This simple helper saves some filesystems conversion from byte offset
    to page numbers and also makes the fdata* interface more complete.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jan Kara

    Jan Kara
     

07 Jul, 2009

1 commit

  • In testing a backport of the write_begin/write_end AOPs, a 10% re-read
    regression was noticed when running iozone. This regression was
    introduced because the old AOPs would always do a mark_page_accessed(page)
    after the commit_write, but when the new AOPs where introduced, the only
    place this was kept was in pagecache_write_end().

    This patch does the same thing in the generic case as what is done in
    pagecache_write_end(), which is just to mark the page accessed before we
    do write_end().

    Signed-off-by: Josef Bacik
    Acked-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Josef Bacik
     

17 Jun, 2009

8 commits

  • Callers of alloc_pages_node() can optionally specify -1 as a node to mean
    "allocate from the current node". However, a number of the callers in
    fast paths know for a fact their node is valid. To avoid a comparison and
    branch, this patch adds alloc_pages_exact_node() that only checks the nid
    with VM_BUG_ON(). Callers that know their node is valid are then
    converted.

    Signed-off-by: Mel Gorman
    Reviewed-by: Christoph Lameter
    Reviewed-by: KOSAKI Motohiro
    Reviewed-by: Pekka Enberg
    Acked-by: Paul Mundt [for the SLOB NUMA bits]
    Cc: Peter Zijlstra
    Cc: Nick Piggin
    Cc: Dave Hansen
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Now that we do readahead for sequential mmap reads, here is a simple
    evaluation of the impacts, and one further optimization.

    It's an NFS-root debian desktop system, readahead size = 60 pages.
    The numbers are grabbed after a fresh boot into console.

    approach pgmajfault RA miss ratio mmap IO count avg IO size(pages)
    A 383 31.6% 383 11
    B 225 32.4% 390 11
    C 224 32.6% 307 13

    case A: mmap sync/async readahead disabled
    case B: mmap sync/async readahead enabled, with enforced full async readahead size
    case C: mmap sync/async readahead enabled, with enforced full sync/async readahead size
    or:
    A = vanilla 2.6.30-rc1
    B = A plus mmap readahead
    C = B plus this patch

    The numbers show that
    - there are good possibilities for random mmap reads to trigger readahead
    - 'pgmajfault' is reduced by 1/3, due to the _async_ nature of readahead
    - case C can further reduce IO count by 1/4
    - readahead miss ratios are not quite affected

    The theory is
    - readahead is _good_ for clustered random reads, and can perform
    _better_ than readaround because they could be _async_.
    - async readahead size is guaranteed to be larger than readaround
    size, and they are _async_, hence will mostly behave better
    However for B
    - sync readahead size could be smaller than readaround size, hence may
    make things worse by produce more smaller IOs
    which will be fixed by this patch.

    Final conclusion:
    - mmap readahead reduced major faults by 1/3 and no obvious overheads;
    - mmap io can be further reduced by 1/4 with this patch.

    Signed-off-by: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • Signed-off-by: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • Mmap read-around now shares the same code style and data structure with
    readahead code.

    This also removes do_page_cache_readahead(). Its last user, mmap
    read-around, has been changed to call ra_submit().

    The no-readahead-if-congested logic is dumped by the way. Users will be
    pretty sensitive about the slow loading of executables. So it's
    unfavorable to disabled mmap read-around on a congested queue.

    [akpm@linux-foundation.org: coding-style fixes]
    Cc: Nick Piggin
    Signed-off-by: Fengguang Wu
    Cc: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • We need this in one particular case and two more general ones.

    Now we do async readahead for sequential mmap reads, and do it with the
    help of PG_readahead. For normal reads, PG_readahead is the sufficient
    condition to do a sequential readahead. But unfortunately, for mmap
    reads, there is a tiny nuisance:

    [11736.998347] readahead-init0(process: sh/23926, file: sda1/w3m, offset=0:4503599627370495, ra=0+4-3) = 4
    [11737.014985] readahead-around(process: w3m/23926, file: sda1/w3m, offset=0:0, ra=290+32-0) = 17
    [11737.019488] readahead-around(process: w3m/23926, file: sda1/w3m, offset=0:0, ra=118+32-0) = 32
    [11737.024921] readahead-interleaved(process: w3m/23926, file: sda1/w3m, offset=0:2, ra=4+6-6) = 6
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~

    An unfavorably small readahead. The original dumb read-around size could
    be more efficient.

    That happened because ld-linux.so does a read(832) in L1 before mmap(),
    which triggers a 4-page readahead, with the second page tagged
    PG_readahead.

    L0: open("/lib/libc.so.6", O_RDONLY) = 3
    L1: read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\340\342"..., 832) = 832
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    L2: fstat(3, {st_mode=S_IFREG|0755, st_size=1420624, ...}) = 0
    L3: mmap(NULL, 3527256, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fac6e51d000
    L4: mprotect(0x7fac6e671000, 2097152, PROT_NONE) = 0
    L5: mmap(0x7fac6e871000, 20480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x154000) = 0x7fac6e871000
    L6: mmap(0x7fac6e876000, 16984, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fac6e876000
    L7: close(3) = 0

    In general, the PG_readahead flag will also be hit in cases

    - sequential reads

    - clustered random reads

    A full readahead size is desirable in both cases.

    Cc: Nick Piggin
    Signed-off-by: Wu Fengguang
    Cc: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • Auto-detect sequential mmap reads and do readahead for them.

    The sequential mmap readahead will be triggered when
    - sync readahead: it's a major fault and (prev_offset == offset-1);
    - async readahead: minor fault on PG_readahead page with valid readahead state.

    The benefits of doing readahead instead of read-around:
    - less I/O wait thanks to async readahead
    - double real I/O size and no more cache hits

    The single stream case is improved a little.
    For 100,000 sequential mmap reads:

    user system cpu total
    (1-1) plain -mm, 128KB readaround: 3.224 2.554 48.40% 11.838
    (1-2) plain -mm, 256KB readaround: 3.170 2.392 46.20% 11.976
    (2) patched -mm, 128KB readahead: 3.117 2.448 47.33% 11.607

    The patched (2) has smallest total time, since it has no cache hit overheads
    and less I/O block time(thanks to async readahead). Here the I/O size
    makes no much difference, since there's only one single stream.

    Note that (1-1)'s real I/O size is 64KB and (1-2)'s real I/O size is 128KB,
    since the half of the read-around pages will be readahead cache hits.

    This is going to make _real_ differences for _concurrent_ IO streams.

    Cc: Nick Piggin
    Signed-off-by: Wu Fengguang
    Cc: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • This shouldn't really change behavior all that much, but the single rather
    complex function with read-ahead inside a loop etc is broken up into more
    manageable pieces.

    The behaviour is also less subtle, with the read-ahead being done up-front
    rather than inside some subtle loop and thus avoiding the now unnecessary
    extra state variables (ie "did_readaround" is gone).

    Fengguang: the code split in fact fixed a bug reported by Pavel Levshin:
    the PGMAJFAULT accounting used to be bypassed when MADV_RANDOM is set, in
    which case the original code will directly jump to no_cached_page reading.

    Cc: Pavel Levshin
    Cc:
    Cc: Nick Piggin
    Signed-off-by: Wu Fengguang
    Signed-off-by: Linus Torvalds
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Impact: code simplification.

    Cc: Nick Piggin
    Signed-off-by: Wu Fengguang
    Cc: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     

29 May, 2009

1 commit

  • mapping->tree_lock can be acquired from interrupt context. Then,
    following dead lock can occur.

    Assume "A" as a page.

    CPU0:
    lock_page_cgroup(A)
    interrupted
    -> take mapping->tree_lock.
    CPU1:
    take mapping->tree_lock
    -> lock_page_cgroup(A)

    This patch tries to fix above deadlock by moving memcg's hook to out of
    mapping->tree_lock. charge/uncharge of pagecache/swapcache is protected
    by page lock, not tree_lock.

    After this patch, lock_page_cgroup() is not called under mapping->tree_lock.

    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     

16 Apr, 2009

1 commit


14 Apr, 2009

1 commit

  • Fix filemap.c kernel-doc warnings:

    Warning(mm/filemap.c:575): No description found for parameter 'page'
    Warning(mm/filemap.c:575): No description found for parameter 'waiter'

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     

04 Apr, 2009

1 commit