17 Dec, 2009

1 commit

  • In the case of direct I/O falling back to buffered I/O we sync data
    twice currently: once at the end of generic_file_buffered_write using
    filemap_write_and_wait_range and once a little later in
    __generic_file_aio_write using do_sync_mapping_range with all flags set.

    The wait before write of the do_sync_mapping_range call does not make
    any sense, so just keep the filemap_write_and_wait_range call and move
    it to the right spot.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     

10 Dec, 2009

1 commit


04 Dec, 2009

1 commit

  • That is "success", "unknown", "through", "performance", "[re|un]mapping"
    , "access", "default", "reasonable", "[con]currently", "temperature"
    , "channel", "[un]used", "application", "example","hierarchy", "therefore"
    , "[over|under]flow", "contiguous", "threshold", "enough" and others.

    Signed-off-by: André Goddard Rosa
    Signed-off-by: Jiri Kosina

    André Goddard Rosa
     

28 Sep, 2009

1 commit


24 Sep, 2009

3 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
    truncate: use new helpers
    truncate: new helpers
    fs: fix overflow in sys_mount() for in-kernel calls
    fs: Make unload_nls() NULL pointer safe
    freeze_bdev: grab active reference to frozen superblocks
    freeze_bdev: kill bd_mount_sem
    exofs: remove BKL from super operations
    fs/romfs: correct error-handling code
    vfs: seq_file: add helpers for data filling
    vfs: remove redundant position check in do_sendfile
    vfs: change sb->s_maxbytes to a loff_t
    vfs: explicitly cast s_maxbytes in fiemap_check_ranges
    libfs: return error code on failed attr set
    seq_file: return a negative error code when seq_path_root() fails.
    vfs: optimize touch_time() too
    vfs: optimization for touch_atime()
    vfs: split generic_forget_inode() so that hugetlbfs does not have to copy it
    fs/inode.c: add dev-id and inode number for debugging in init_special_inode()
    libfs: make simple_read_from_buffer conventional

    Linus Torvalds
     
  • * 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (21 commits)
    HWPOISON: Enable error_remove_page on btrfs
    HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs
    HWPOISON: Add madvise() based injector for hardware poisoned pages v4
    HWPOISON: Enable error_remove_page for NFS
    HWPOISON: Enable .remove_error_page for migration aware file systems
    HWPOISON: The high level memory error handler in the VM v7
    HWPOISON: Add PR_MCE_KILL prctl to control early kill behaviour per process
    HWPOISON: shmem: call set_page_dirty() with locked page
    HWPOISON: Define a new error_remove_page address space op for async truncation
    HWPOISON: Add invalidate_inode_page
    HWPOISON: Refactor truncate to allow direct truncating of page v2
    HWPOISON: check and isolate corrupted free pages v2
    HWPOISON: Handle hardware poisoned pages in try_to_unmap
    HWPOISON: Use bitmask/action code for try_to_unmap behaviour
    HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler v2
    HWPOISON: Add poison check to page fault handling
    HWPOISON: Add basic support for poisoned pages in fault handler v3
    HWPOISON: Add new SIGBUS error codes for hardware poison signals
    HWPOISON: Add support for poison swap entries v2
    HWPOISON: Export some rmap vma locking to outside world
    ...

    Linus Torvalds
     
  • Introduce new truncate helpers truncate_pagecache and inode_newsize_ok.
    vmtruncate is also consolidated from mm/memory.c and mm/nommu.c and
    into mm/truncate.c.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Nick Piggin
    Signed-off-by: Al Viro

    npiggin@suse.de
     

22 Sep, 2009

1 commit

  • Recently we encountered OOM problems due to memory use of the GEM cache.
    Generally a large amuont of Shmem/Tmpfs pages tend to create a memory
    shortage problem.

    We often use the following calculation to determine the amount of shmem
    pages:

    shmem = NR_ACTIVE_ANON + NR_INACTIVE_ANON - NR_ANON_PAGES

    however the expression does not consider isolated and mlocked pages.

    This patch adds explicit accounting for pages used by shmem and tmpfs.

    Signed-off-by: KOSAKI Motohiro
    Acked-by: Rik van Riel
    Reviewed-by: Christoph Lameter
    Acked-by: Wu Fengguang
    Cc: David Rientjes
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

16 Sep, 2009

1 commit

  • Add the high level memory handler that poisons pages
    that got corrupted by hardware (typically by a two bit flip in a DIMM
    or a cache) on the Linux level. The goal is to prevent everyone
    from accessing these pages in the future.

    This done at the VM level by marking a page hwpoisoned
    and doing the appropriate action based on the type of page
    it is.

    The code that does this is portable and lives in mm/memory-failure.c

    To quote the overview comment:

    High level machine check handler. Handles pages reported by the
    hardware as being corrupted usually due to a 2bit ECC memory or cache
    failure.

    This focuses on pages detected as corrupted in the background.
    When the current CPU tries to consume corruption the currently
    running process can just be killed directly instead. This implies
    that if the error cannot be handled for some reason it's safe to
    just ignore it because no corruption has been consumed yet. Instead
    when that happens another machine check will happen.

    Handles page cache pages in various states. The tricky part
    here is that we can access any page asynchronous to other VM
    users, because memory failures could happen anytime and anywhere,
    possibly violating some of their assumptions. This is why this code
    has to be extremely careful. Generally it tries to use normal locking
    rules, as in get the standard locks, even if that means the
    error handling takes potentially a long time.

    Some of the operations here are somewhat inefficient and have non
    linear algorithmic complexity, because the data structures have not
    been optimized for this case. This is in particular the case
    for the mapping from a vma to a process. Since this case is expected
    to be rare we hope we can get away with this.

    There are in principle two strategies to kill processes on poison:
    - just unmap the data and wait for an actual reference before
    killing
    - kill as soon as corruption is detected.
    Both have advantages and disadvantages and should be used
    in different situations. Right now both are implemented and can
    be switched with a new sysctl vm.memory_failure_early_kill
    The default is early kill.

    The patch does some rmap data structure walking on its own to collect
    processes to kill. This is unusual because normally all rmap data structure
    knowledge is in rmap.c only. I put it here for now to keep
    everything together and rmap knowledge has been seeping out anyways

    Includes contributions from Johannes Weiner, Chris Mason, Fengguang Wu,
    Nick Piggin (who did a lot of great work) and others.

    Cc: npiggin@suse.de
    Cc: riel@redhat.com
    Signed-off-by: Andi Kleen
    Acked-by: Rik van Riel
    Reviewed-by: Hidehiro Kawai

    Andi Kleen
     

14 Sep, 2009

6 commits

  • Remove these three functions since nobody uses them anymore.

    Signed-off-by: Jan Kara

    Jan Kara
     
  • Introduce new function for generic inode syncing (vfs_fsync_range) and use
    it from fsync() path. Introduce also new helper for syncing after a sync
    write (generic_write_sync) using the generic function.

    Use these new helpers for syncing from generic VFS functions. This makes
    O_SYNC writes to block devices acquire i_mutex for syncing. If we really
    care about this, we can make block_fsync() drop the i_mutex and reacquire
    it before it returns.

    CC: Evgeniy Polyakov
    CC: ocfs2-devel@oss.oracle.com
    CC: Joel Becker
    CC: Felix Blyakher
    CC: xfs@oss.sgi.com
    CC: Anton Altaparmakov
    CC: linux-ntfs-dev@lists.sourceforge.net
    CC: OGAWA Hirofumi
    CC: linux-ext4@vger.kernel.org
    CC: tytso@mit.edu
    Acked-by: Christoph Hellwig
    Signed-off-by: Jan Kara

    Jan Kara
     
  • generic_file_aio_write_nolock() is now used only by block devices and raw
    character device. Filesystems should use __generic_file_aio_write() in case
    generic_file_aio_write() doesn't suit them. So rename the function to
    blkdev_aio_write() and move it to fs/blockdev.c.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jan Kara

    Christoph Hellwig
     
  • generic_file_direct_write() and generic_file_buffered_write() called
    generic_osync_inode() if it was called on O_SYNC file or IS_SYNC inode. But
    this is superfluous since generic_file_aio_write() does the syncing as well.
    Also XFS and OCFS2 which call these functions directly handle syncing
    themselves. So let's have a single place where syncing happens:
    generic_file_aio_write().

    We slightly change the behavior by syncing only the range of file to which the
    write happened for buffered writes but that should be all that is required.

    CC: ocfs2-devel@oss.oracle.com
    CC: Joel Becker
    CC: Felix Blyakher
    CC: xfs@oss.sgi.com
    Signed-off-by: Jan Kara

    Jan Kara
     
  • Rename __generic_file_aio_write_nolock() to __generic_file_aio_write(), add
    comments to write helpers explaining how they should be used and export
    __generic_file_aio_write() since it will be used by some filesystems.

    CC: ocfs2-devel@oss.oracle.com
    CC: Joel Becker
    Acked-by: Evgeniy Polyakov
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jan Kara

    Jan Kara
     
  • This simple helper saves some filesystems conversion from byte offset
    to page numbers and also makes the fdata* interface more complete.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jan Kara

    Jan Kara
     

07 Jul, 2009

1 commit

  • In testing a backport of the write_begin/write_end AOPs, a 10% re-read
    regression was noticed when running iozone. This regression was
    introduced because the old AOPs would always do a mark_page_accessed(page)
    after the commit_write, but when the new AOPs where introduced, the only
    place this was kept was in pagecache_write_end().

    This patch does the same thing in the generic case as what is done in
    pagecache_write_end(), which is just to mark the page accessed before we
    do write_end().

    Signed-off-by: Josef Bacik
    Acked-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Josef Bacik
     

17 Jun, 2009

8 commits

  • Callers of alloc_pages_node() can optionally specify -1 as a node to mean
    "allocate from the current node". However, a number of the callers in
    fast paths know for a fact their node is valid. To avoid a comparison and
    branch, this patch adds alloc_pages_exact_node() that only checks the nid
    with VM_BUG_ON(). Callers that know their node is valid are then
    converted.

    Signed-off-by: Mel Gorman
    Reviewed-by: Christoph Lameter
    Reviewed-by: KOSAKI Motohiro
    Reviewed-by: Pekka Enberg
    Acked-by: Paul Mundt [for the SLOB NUMA bits]
    Cc: Peter Zijlstra
    Cc: Nick Piggin
    Cc: Dave Hansen
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Now that we do readahead for sequential mmap reads, here is a simple
    evaluation of the impacts, and one further optimization.

    It's an NFS-root debian desktop system, readahead size = 60 pages.
    The numbers are grabbed after a fresh boot into console.

    approach pgmajfault RA miss ratio mmap IO count avg IO size(pages)
    A 383 31.6% 383 11
    B 225 32.4% 390 11
    C 224 32.6% 307 13

    case A: mmap sync/async readahead disabled
    case B: mmap sync/async readahead enabled, with enforced full async readahead size
    case C: mmap sync/async readahead enabled, with enforced full sync/async readahead size
    or:
    A = vanilla 2.6.30-rc1
    B = A plus mmap readahead
    C = B plus this patch

    The numbers show that
    - there are good possibilities for random mmap reads to trigger readahead
    - 'pgmajfault' is reduced by 1/3, due to the _async_ nature of readahead
    - case C can further reduce IO count by 1/4
    - readahead miss ratios are not quite affected

    The theory is
    - readahead is _good_ for clustered random reads, and can perform
    _better_ than readaround because they could be _async_.
    - async readahead size is guaranteed to be larger than readaround
    size, and they are _async_, hence will mostly behave better
    However for B
    - sync readahead size could be smaller than readaround size, hence may
    make things worse by produce more smaller IOs
    which will be fixed by this patch.

    Final conclusion:
    - mmap readahead reduced major faults by 1/3 and no obvious overheads;
    - mmap io can be further reduced by 1/4 with this patch.

    Signed-off-by: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • Signed-off-by: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • Mmap read-around now shares the same code style and data structure with
    readahead code.

    This also removes do_page_cache_readahead(). Its last user, mmap
    read-around, has been changed to call ra_submit().

    The no-readahead-if-congested logic is dumped by the way. Users will be
    pretty sensitive about the slow loading of executables. So it's
    unfavorable to disabled mmap read-around on a congested queue.

    [akpm@linux-foundation.org: coding-style fixes]
    Cc: Nick Piggin
    Signed-off-by: Fengguang Wu
    Cc: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • We need this in one particular case and two more general ones.

    Now we do async readahead for sequential mmap reads, and do it with the
    help of PG_readahead. For normal reads, PG_readahead is the sufficient
    condition to do a sequential readahead. But unfortunately, for mmap
    reads, there is a tiny nuisance:

    [11736.998347] readahead-init0(process: sh/23926, file: sda1/w3m, offset=0:4503599627370495, ra=0+4-3) = 4
    [11737.014985] readahead-around(process: w3m/23926, file: sda1/w3m, offset=0:0, ra=290+32-0) = 17
    [11737.019488] readahead-around(process: w3m/23926, file: sda1/w3m, offset=0:0, ra=118+32-0) = 32
    [11737.024921] readahead-interleaved(process: w3m/23926, file: sda1/w3m, offset=0:2, ra=4+6-6) = 6
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~

    An unfavorably small readahead. The original dumb read-around size could
    be more efficient.

    That happened because ld-linux.so does a read(832) in L1 before mmap(),
    which triggers a 4-page readahead, with the second page tagged
    PG_readahead.

    L0: open("/lib/libc.so.6", O_RDONLY) = 3
    L1: read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\340\342"..., 832) = 832
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    L2: fstat(3, {st_mode=S_IFREG|0755, st_size=1420624, ...}) = 0
    L3: mmap(NULL, 3527256, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fac6e51d000
    L4: mprotect(0x7fac6e671000, 2097152, PROT_NONE) = 0
    L5: mmap(0x7fac6e871000, 20480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x154000) = 0x7fac6e871000
    L6: mmap(0x7fac6e876000, 16984, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fac6e876000
    L7: close(3) = 0

    In general, the PG_readahead flag will also be hit in cases

    - sequential reads

    - clustered random reads

    A full readahead size is desirable in both cases.

    Cc: Nick Piggin
    Signed-off-by: Wu Fengguang
    Cc: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • Auto-detect sequential mmap reads and do readahead for them.

    The sequential mmap readahead will be triggered when
    - sync readahead: it's a major fault and (prev_offset == offset-1);
    - async readahead: minor fault on PG_readahead page with valid readahead state.

    The benefits of doing readahead instead of read-around:
    - less I/O wait thanks to async readahead
    - double real I/O size and no more cache hits

    The single stream case is improved a little.
    For 100,000 sequential mmap reads:

    user system cpu total
    (1-1) plain -mm, 128KB readaround: 3.224 2.554 48.40% 11.838
    (1-2) plain -mm, 256KB readaround: 3.170 2.392 46.20% 11.976
    (2) patched -mm, 128KB readahead: 3.117 2.448 47.33% 11.607

    The patched (2) has smallest total time, since it has no cache hit overheads
    and less I/O block time(thanks to async readahead). Here the I/O size
    makes no much difference, since there's only one single stream.

    Note that (1-1)'s real I/O size is 64KB and (1-2)'s real I/O size is 128KB,
    since the half of the read-around pages will be readahead cache hits.

    This is going to make _real_ differences for _concurrent_ IO streams.

    Cc: Nick Piggin
    Signed-off-by: Wu Fengguang
    Cc: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • This shouldn't really change behavior all that much, but the single rather
    complex function with read-ahead inside a loop etc is broken up into more
    manageable pieces.

    The behaviour is also less subtle, with the read-ahead being done up-front
    rather than inside some subtle loop and thus avoiding the now unnecessary
    extra state variables (ie "did_readaround" is gone).

    Fengguang: the code split in fact fixed a bug reported by Pavel Levshin:
    the PGMAJFAULT accounting used to be bypassed when MADV_RANDOM is set, in
    which case the original code will directly jump to no_cached_page reading.

    Cc: Pavel Levshin
    Cc:
    Cc: Nick Piggin
    Signed-off-by: Wu Fengguang
    Signed-off-by: Linus Torvalds
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Impact: code simplification.

    Cc: Nick Piggin
    Signed-off-by: Wu Fengguang
    Cc: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     

29 May, 2009

1 commit

  • mapping->tree_lock can be acquired from interrupt context. Then,
    following dead lock can occur.

    Assume "A" as a page.

    CPU0:
    lock_page_cgroup(A)
    interrupted
    -> take mapping->tree_lock.
    CPU1:
    take mapping->tree_lock
    -> lock_page_cgroup(A)

    This patch tries to fix above deadlock by moving memcg's hook to out of
    mapping->tree_lock. charge/uncharge of pagecache/swapcache is protected
    by page lock, not tree_lock.

    After this patch, lock_page_cgroup() is not called under mapping->tree_lock.

    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     

16 Apr, 2009

1 commit


14 Apr, 2009

1 commit

  • Fix filemap.c kernel-doc warnings:

    Warning(mm/filemap.c:575): No description found for parameter 'page'
    Warning(mm/filemap.c:575): No description found for parameter 'waiter'

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     

04 Apr, 2009

1 commit


03 Apr, 2009

2 commits

  • Add a function to install a monitor on the page lock waitqueue for a particular
    page, thus allowing the page being unlocked to be detected.

    This is used by CacheFiles to detect read completion on a page in the backing
    filesystem so that it can then copy the data to the waiting netfs page.

    Signed-off-by: David Howells
    Acked-by: Steve Dickson
    Acked-by: Trond Myklebust
    Acked-by: Rik van Riel
    Acked-by: Al Viro
    Tested-by: Daire Byrne

    David Howells
     
  • Recruit a page flag to aid in cache management. The following extra flag is
    defined:

    (1) PG_fscache (PG_private_2)

    The marked page is backed by a local cache and is pinning resources in the
    cache driver.

    If PG_fscache is set, then things that checked for PG_private will now also
    check for that. This includes things like truncation and page invalidation.
    The function page_has_private() had been added to make the checks for both
    PG_private and PG_private_2 at the same time.

    Signed-off-by: David Howells
    Acked-by: Steve Dickson
    Acked-by: Trond Myklebust
    Acked-by: Rik van Riel
    Acked-by: Al Viro
    Tested-by: Daire Byrne

    David Howells
     

02 Mar, 2009

1 commit

  • Impact: standardize IO on cached ops

    On modern CPUs it is almost always a bad idea to use non-temporal stores,
    as the regression in this commit has shown it:

    30d697f: x86: fix performance regression in write() syscall

    The kernel simply has no good information about whether using non-temporal
    stores is a good idea or not - and trying to add heuristics only increases
    complexity and inserts fragility.

    The regression on cached write()s took very long to be found - over two
    years. So dont take any chances and let the hardware decide how it makes
    use of its caches.

    The only exception is drivers/gpu/drm/i915/i915_gem.c: there were we are
    absolutely sure that another entity (the GPU) will pick up the dirty
    data immediately and that the CPU will not touch that data before the
    GPU will.

    Also, keep the _nocache() primitives to make it easier for people to
    experiment with these details. There may be more clear-cut cases where
    non-cached copies can be used, outside of filemap.c.

    Cc: Salman Qazi
    Cc: Nick Piggin
    Cc: Linus Torvalds
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

25 Feb, 2009

1 commit

  • Impact: cleanup, enable future change

    Add a 'total bytes copied' parameter to __copy_from_user_*nocache(),
    and update all the callsites.

    The parameter is not used yet - architecture code can use it to
    more intelligently decide whether the copy should be cached or
    non-temporal.

    Cc: Salman Qazi
    Cc: Nick Piggin
    Cc: Linus Torvalds
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

14 Jan, 2009

2 commits

  • System calls with an unsigned long long argument can't be converted with
    the standard wrappers since that would include a cast to long, which in
    turn means that we would lose the upper 32 bit on 32 bit architectures.
    Also semctl can't use the standard wrapper since it has a 'union'
    parameter.

    So we handle them as special case and add some extra wrappers instead.

    Signed-off-by: Heiko Carstens

    Heiko Carstens
     
  • Convert all system calls to return a long. This should be a NOP since all
    converted types should have the same size anyway.
    With the exception of sys_exit_group which returned void. But that doesn't
    matter since the system call doesn't return.

    Signed-off-by: Heiko Carstens

    Heiko Carstens
     

09 Jan, 2009

1 commit

  • My patch, memcg-fix-gfp_mask-of-callers-of-charge.patch changed gfp_mask
    of callers of charge to be GFP_HIGHUSER_MOVABLE for showing what will
    happen at memory reclaim.

    But in recent discussion, it's NACKed because it sounds ugly.

    This patch is for reverting it and add some clean up to gfp_mask of
    callers of charge. No behavior change but need review before generating
    HUNK in deep queue.

    This patch also adds explanation to meaning of gfp_mask passed to charge
    functions in memcontrol.h.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: Hugh Dickins
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

07 Jan, 2009

4 commits

  • Frustratingly, gfp_t is really divided into two classes of flags. One are
    the context dependent ones (can we sleep? can we enter filesystem? block
    subsystem? should we use some extra reserves, etc.). The other ones are
    the type of memory required and depend on how the algorithm is implemented
    rather than the point at which the memory is allocated (highmem? dma
    memory? etc).

    Some of the functions which allocate a page and add it to page cache take
    a gfp_t, but sometimes those functions or their callers aren't really
    doing the right thing: when allocating pagecache page, the memory type
    should be mapping_gfp_mask(mapping). When allocating radix tree nodes,
    the memory type should be kernel mapped (not highmem) memory. The gfp_t
    argument should only really be needed for context dependent options.

    This patch doesn't really solve that tangle in a nice way, but it does
    attempt to fix a couple of bugs.

    - find_or_create_page changes its radix-tree allocation to only include
    the main context dependent flags in order so the pagecache page may be
    allocated from arbitrary types of memory without affecting the
    radix-tree. In practice, slab allocations don't come from highmem
    anyway, and radix-tree only uses slab allocations. So there isn't a
    practical change (unless some fs uses GFP_DMA for pages).

    - grab_cache_page_nowait() is changed to allocate radix-tree nodes with
    GFP_NOFS, because it is not supposed to reenter the filesystem. This
    bug could cause lock recursion if a filesystem is not expecting the
    function to reenter the fs (as-per documentation).

    Filesystems should be careful about exactly what semantics they want and
    what they get when fiddling with gfp_t masks to allocate pagecache. One
    should be as liberal as possible with the type of memory that can be used,
    and same for the the context specific flags.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Direct IO can invalidate and sync a lot of pagecache pages in the mapping.
    A 4K direct IO will actually try to sync and/or invalidate the pagecache
    of the entire file, for example (which might be many GB or TB large).

    Improve this by doing range syncs. Also, memory no longer has to be
    unmapped to catch the dirty bits for syncing, as dirty bits would remain
    coherent due to dirty mmap accounting.

    This fixes the immediate DM deadlocks when doing direct IO reads to block
    device with a mounted filesystem, if only by papering over the problem
    somewhat rather than addressing the fsync starvation cases.

    Signed-off-by: Nick Piggin
    Reviewed-by: Jeff Moyer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • In write_cache_pages, nr_to_write is heeded even for data-integrity syncs,
    so the function will return success after writing out nr_to_write pages,
    even if that was not sufficient to guarantee data integrity.

    The callers tend to set it to values that could break data interity
    semantics easily in practice. For example, nr_to_write can be set to
    mapping->nr_pages * 2, however if a file has a single, dirty page, then
    fsync is called, subsequent pages might be concurrently added and dirtied,
    then write_cache_pages might writeout two of these newly dirty pages,
    while not writing out the old page that should have been written out.

    Fix this by ignoring nr_to_write if it is a data integrity sync.

    This is a data integrity bug.

    The reason this has been done in the past is to avoid stalling sync
    operations behind page dirtiers.

    "If a file has one dirty page at offset 1000000000000000 then someone
    does an fsync() and someone else gets in first and starts madly writing
    pages at offset 0, we want to write that page at 1000000000000000.
    Somehow."

    What we do today is return success after an arbitrary amount of pages are
    written, whether or not we have provided the data-integrity semantics that
    the caller has asked for. Even this doesn't actually fix all stall cases
    completely: in the above situation, if the file has a huge number of pages
    in pagecache (but not dirty), then mapping->nrpages is going to be huge,
    even if pages are being dirtied.

    This change does indeed make the possibility of long stalls lager, and
    that's not a good thing, but lying about data integrity is even worse. We
    have to either perform the sync, or return -ELINUXISLAME so at least the
    caller knows what has happened.

    There are subsequent competing approaches in the works to solve the stall
    problems properly, without compromising data integrity.

    Signed-off-by: Nick Piggin
    Cc: Chris Mason
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Doing a mark_page_accessed at fault-time, then doing SetPageReferenced at
    unmap-time if the pte is young has a number of problems.

    mark_page_accessed is supposed to be roughly the equivalent of a young pte
    for unmapped references. Unfortunately it doesn't come with any context:
    after being called, reclaim doesn't know who or why the page was touched.

    So calling mark_page_accessed not only adds extra lru or PG_referenced
    manipulations for pages that are already going to have pte_young ptes anyway,
    but it also adds these references which are difficult to work with from the
    context of vma specific references (eg. MADV_SEQUENTIAL pte_young may not
    wish to contribute to the page being referenced).

    Then, simply doing SetPageReferenced when zapping a pte and finding it is
    young, is not a really good solution either. SetPageReferenced does not
    correctly promote the page to the active list for example. So after removing
    mark_page_accessed from the fault path, several mmap()+touch+munmap() would
    have a very different result from several read(2) calls for example, which
    is not really desirable.

    Signed-off-by: Nick Piggin
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

06 Jan, 2009

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
    inotify: fix type errors in interfaces
    fix breakage in reiserfs_new_inode()
    fix the treatment of jfs special inodes
    vfs: remove duplicate code in get_fs_type()
    add a vfs_fsync helper
    sys_execve and sys_uselib do not call into fsnotify
    zero i_uid/i_gid on inode allocation
    inode->i_op is never NULL
    ntfs: don't NULL i_op
    isofs check for NULL ->i_op in root directory is dead code
    affs: do not zero ->i_op
    kill suid bit only for regular files
    vfs: lseek(fd, 0, SEEK_CUR) race condition

    Linus Torvalds