14 Jan, 2011

3 commits

  • Running the annotated branch profiler on a box doing average work
    (firefox, evolution, xchat, distcc farm), the likely() used in
    grab_cache_page_write_begin() was incorrect most of the time:

    correct incorrect % Function File Line
    ------- --------- - -------- ---- ----
    1924262 71332401 97 grab_cache_page_write_begin filemap.c 2206

    Adding a trace_printk() and running the function tracer limited to
    just this function I can see:

    gconfd-2-2696 [000] 4467.268935: grab_cache_page_write_begin: page= (null) mapping=ffff8800676a9460 index=7
    gconfd-2-2696 [000] 4467.268946: grab_cache_page_write_begin
    Acked-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Rostedt
     
  • Temporary IO failures, eg. due to loss of both multipath paths, can
    permanently leave the PageError bit set on a page, resulting in msync or
    fsync returning -EIO over and over again, even if IO is now getting to the
    disk correctly.

    We already clear the AS_ENOSPC and AS_IO bits in mapping->flags in the
    filemap_fdatawait_range function. Also clearing the PageError bit on the
    page allows subsequent msync or fsync calls on this file to return without
    an error, if the subsequent IO succeeds.

    Unfortunately data written out in the msync or fsync call that returned
    -EIO can still get lost, because the page dirty bit appears to not get
    restored on IO error. However, the alternative could be potentially all
    of memory filling up with uncleanable dirty pages, hanging the system, so
    there is no nice choice here...

    Signed-off-by: Rik van Riel
    Acked-by: Valerie Aurora
    Acked-by: Jeff Layton
    Cc: Theodore Ts'o
    Acked-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • Testing ->mapping and ->index without a ref is not stable as the page
    may have been reused at this point.

    Signed-off-by: Nick Piggin
    Reviewed-by: Wu Fengguang
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

07 Jan, 2011

1 commit


02 Dec, 2010

1 commit

  • NFS needs to be able to release objects that are stored in the page
    cache once the page itself is no longer visible from the page cache.

    This patch adds a callback to the address space operations that allows
    filesystems to perform page cleanups once the page has been removed
    from the page cache.

    Original patch by: Linus Torvalds
    [trondmy: cover the cases of invalidate_inode_pages2() and
    truncate_inode_pages()]
    Signed-off-by: Trond Myklebust

    Linus Torvalds
     

12 Nov, 2010

2 commits

  • Salman Qazi describes the following radix-tree bug:

    In the following case, we get can get a deadlock:

    0. The radix tree contains two items, one has the index 0.
    1. The reader (in this case find_get_pages) takes the rcu_read_lock.
    2. The reader acquires slot(s) for item(s) including the index 0 item.
    3. The non-zero index item is deleted, and as a consequence the other item is
    moved to the root of the tree. The place where it used to be is queued for
    deletion after the readers finish.
    3b. The zero item is deleted, removing it from the direct slot, it remains in
    the rcu-delayed indirect node.
    4. The reader looks at the index 0 slot, and finds that the page has 0 ref
    count
    5. The reader looks at it again, hoping that the item will either be freed or
    the ref count will increase. This never happens, as the slot it is looking
    at will never be updated. Also, this slot can never be reclaimed because
    the reader is holding rcu_read_lock and is in an infinite loop.

    The fix is to re-use the same "indirect" pointer case that requires a slot
    lookup retry into a general "retry the lookup" bit.

    Signed-off-by: Nick Piggin
    Reported-by: Salman Qazi
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • 70 hours into some stress tests of a 2.6.32-based enterprise kernel, we
    ran into a NULL dereference in here:

    int block_is_partially_uptodate(struct page *page, read_descriptor_t *desc,
    unsigned long from)
    {
    ----> struct inode *inode = page->mapping->host;

    It looks like page->mapping was the culprit. (xmon trace is below).
    After closer examination, I realized that do_generic_file_read() does a
    find_get_page(), and eventually locks the page before calling
    block_is_partially_uptodate(). However, it doesn't revalidate the
    page->mapping after the page is locked. So, there's a small window
    between the find_get_page() and ->is_partially_uptodate() where the page
    could get truncated and page->mapping cleared.

    We _have_ a reference, so it can't get reclaimed, but it certainly
    can be truncated.

    I think the correct thing is to check page->mapping after the
    trylock_page(), and jump out if it got truncated. This patch has been
    running in the test environment for a month or so now, and we have not
    seen this bug pop up again.

    xmon info:

    1f:mon> e
    cpu 0x1f: Vector: 300 (Data Access) at [c0000002ae36f770]
    pc: c0000000001e7a6c: .block_is_partially_uptodate+0xc/0x100
    lr: c000000000142944: .generic_file_aio_read+0x1e4/0x770
    sp: c0000002ae36f9f0
    msr: 8000000000009032
    dar: 0
    dsisr: 40000000
    current = 0xc000000378f99e30
    paca = 0xc000000000f66300
    pid = 21946, comm = bash
    1f:mon> r
    R00 = 0025c0500000006d R16 = 0000000000000000
    R01 = c0000002ae36f9f0 R17 = c000000362cd3af0
    R02 = c000000000e8cd80 R18 = ffffffffffffffff
    R03 = c0000000031d0f88 R19 = 0000000000000001
    R04 = c0000002ae36fa68 R20 = c0000003bb97b8a0
    R05 = 0000000000000000 R21 = c0000002ae36fa68
    R06 = 0000000000000000 R22 = 0000000000000000
    R07 = 0000000000000001 R23 = c0000002ae36fbb0
    R08 = 0000000000000002 R24 = 0000000000000000
    R09 = 0000000000000000 R25 = c000000362cd3a80
    R10 = 0000000000000000 R26 = 0000000000000002
    R11 = c0000000001e7b60 R27 = 0000000000000000
    R12 = 0000000042000484 R28 = 0000000000000001
    R13 = c000000000f66300 R29 = c0000003bb97b9b8
    R14 = 0000000000000001 R30 = c000000000e28a08
    R15 = 000000000000ffff R31 = c0000000031d0f88
    pc = c0000000001e7a6c .block_is_partially_uptodate+0xc/0x100
    lr = c000000000142944 .generic_file_aio_read+0x1e4/0x770
    msr = 8000000000009032 cr = 22000488
    ctr = c0000000001e7a60 xer = 0000000020000000 trap = 300
    dar = 0000000000000000 dsisr = 40000000
    1f:mon> t
    [link register ] c000000000142944 .generic_file_aio_read+0x1e4/0x770
    [c0000002ae36f9f0] c000000000142a14 .generic_file_aio_read+0x2b4/0x770 (unreliable)
    [c0000002ae36fb40] c0000000001b03e4 .do_sync_read+0xd4/0x160
    [c0000002ae36fce0] c0000000001b153c .vfs_read+0xec/0x1f0
    [c0000002ae36fd80] c0000000001b1768 .SyS_read+0x58/0xb0
    [c0000002ae36fe30] c00000000000852c syscall_exit+0x0/0x40
    --- Exception: c00 (System Call) at 00000080a840bc54
    SP (fffca15df30) is in userspace
    1f:mon> di c0000000001e7a6c
    c0000000001e7a6c e9290000 ld r9,0(r9)
    c0000000001e7a70 418200c0 beq c0000000001e7b30 # .block_is_partially_uptodate+0xd0/0x100
    c0000000001e7a74 e9440008 ld r10,8(r4)
    c0000000001e7a78 78a80020 clrldi r8,r5,32
    c0000000001e7a7c 3c000001 lis r0,1
    c0000000001e7a80 812900a8 lwz r9,168(r9)
    c0000000001e7a84 39600001 li r11,1
    c0000000001e7a88 7c080050 subf r0,r8,r0
    c0000000001e7a8c 7f805040 cmplw cr7,r0,r10
    c0000000001e7a90 7d6b4830 slw r11,r11,r9
    c0000000001e7a94 796b0020 clrldi r11,r11,32
    c0000000001e7a98 419d00a8 bgt cr7,c0000000001e7b40 # .block_is_partially_uptodate+0xe0/0x100
    c0000000001e7a9c 7fa55840 cmpld cr7,r5,r11
    c0000000001e7aa0 7d004214 add r8,r0,r8
    c0000000001e7aa4 79080020 clrldi r8,r8,32
    c0000000001e7aa8 419c0078 blt cr7,c0000000001e7b20 # .block_is_partially_uptodate+0xc0/0x100

    Signed-off-by: Dave Hansen
    Reviewed-by: Minchan Kim
    Reviewed-by: Johannes Weiner
    Acked-by: Rik van Riel
    Cc:
    Cc:
    Cc: Christoph Hellwig
    Cc: Al Viro
    Cc: Minchan Kim
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     

03 Nov, 2010

1 commit


27 Oct, 2010

3 commits

  • 'end' shadows earlier one and is not necessary at all. Remove it and use
    'pos' instead. This removes following sparse warnings:

    mm/filemap.c:2180:24: warning: symbol 'end' shadows an earlier one
    mm/filemap.c:2132:25: originally declared here

    Signed-off-by: Namhyung Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Namhyung Kim
     
  • This change reduces mmap_sem hold times that are caused by waiting for
    disk transfers when accessing file mapped VMAs.

    It introduces the VM_FAULT_ALLOW_RETRY flag, which indicates that the call
    site wants mmap_sem to be released if blocking on a pending disk transfer.
    In that case, filemap_fault() returns the VM_FAULT_RETRY status bit and
    do_page_fault() will then re-acquire mmap_sem and retry the page fault.

    It is expected that the retry will hit the same page which will now be
    cached, and thus it will complete with a low mmap_sem hold time.

    Tests:

    - microbenchmark: thread A mmaps a large file and does random read accesses
    to the mmaped area - achieves about 55 iterations/s. Thread B does
    mmap/munmap in a loop at a separate location - achieves 55 iterations/s
    before, 15000 iterations/s after.

    - We are seeing related effects in some applications in house, which show
    significant performance regressions when running without this change.

    [akpm@linux-foundation.org: fix warning & crash]
    Signed-off-by: Michel Lespinasse
    Acked-by: Rik van Riel
    Acked-by: Linus Torvalds
    Cc: Nick Piggin
    Reviewed-by: Wu Fengguang
    Cc: Ying Han
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Acked-by: "H. Peter Anvin"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Introduce a single location where filemap_fault() locks the desired page.
    There used to be two such places, depending if the initial find_get_page()
    was successful or not.

    Signed-off-by: Michel Lespinasse
    Acked-by: Rik van Riel
    Acked-by: Linus Torvalds
    Cc: Nick Piggin
    Reviewed-by: Wu Fengguang
    Cc: Ying Han
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     

10 Aug, 2010

1 commit


31 May, 2010

1 commit


28 May, 2010

1 commit

  • * git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (27 commits)
    Btrfs: add more error checking to btrfs_dirty_inode
    Btrfs: allow unaligned DIO
    Btrfs: drop verbose enospc printk
    Btrfs: Fix block generation verification race
    Btrfs: fix preallocation and nodatacow checks in O_DIRECT
    Btrfs: avoid ENOSPC errors in btrfs_dirty_inode
    Btrfs: move O_DIRECT space reservation to btrfs_direct_IO
    Btrfs: rework O_DIRECT enospc handling
    Btrfs: use async helpers for DIO write checksumming
    Btrfs: don't walk around with task->state != TASK_RUNNING
    Btrfs: do aio_write instead of write
    Btrfs: add basic DIO read/write support
    direct-io: do not merge logically non-contiguous requests
    direct-io: add a hook for the fs to provide its own submit_bio function
    fs: allow short direct-io reads to be completed via buffered IO
    Btrfs: Metadata ENOSPC handling for balance
    Btrfs: Pre-allocate space for data relocation
    Btrfs: Metadata ENOSPC handling for tree log
    Btrfs: Metadata reservation for orphan inodes
    Btrfs: Introduce global metadata reservation
    ...

    Linus Torvalds
     

27 May, 2010

1 commit

  • I/O errors can happen due to temporary failures, like multipath
    errors or losing network contact with the iSCSI server. Because
    of that, the VM will retry readpage on the page.

    However, do_generic_file_read does not clear PG_error. This
    causes the system to be unable to actually use the data in the
    page cache page, even if the subsequent readpage completes
    successfully!

    The function filemap_fault has had a ClearPageError before
    readpage forever. This patch simply adds the same to
    do_generic_file_read.

    Signed-off-by: Jeff Moyer
    Signed-off-by: Rik van Riel
    Acked-by: Larry Woodman
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Jeff Moyer
     

25 May, 2010

4 commits

  • Before applying this patch, cpuset updates task->mems_allowed and
    mempolicy by setting all new bits in the nodemask first, and clearing all
    old unallowed bits later. But in the way, the allocator may find that
    there is no node to alloc memory.

    The reason is that cpuset rebinds the task's mempolicy, it cleans the
    nodes which the allocater can alloc pages on, for example:

    (mpol: mempolicy)
    task1 task1's mpol task2
    alloc page 1
    alloc on node0? NO 1
    1 change mems from 1 to 0
    1 rebind task1's mpol
    0-1 set new bits
    0 clear disallowed bits
    alloc on node1? NO 0
    ...
    can't alloc page
    goto oom

    This patch fixes this problem by expanding the nodes range first(set newly
    allowed bits) and shrink it lazily(clear newly disallowed bits). So we
    use a variable to tell the write-side task that read-side task is reading
    nodemask, and the write-side task clears newly disallowed nodes after
    read-side task ends the current memory allocation.

    [akpm@linux-foundation.org: fix spello]
    Signed-off-by: Miao Xie
    Cc: David Rientjes
    Cc: Nick Piggin
    Cc: Paul Menage
    Cc: Lee Schermerhorn
    Cc: Hugh Dickins
    Cc: Ravikiran Thirumalai
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miao Xie
     
  • Shaohua Li reported parallel file copy on tmpfs can lead to OOM killer.
    This is regression of caused by commit 9ff473b9a7 ("vmscan: evict
    streaming IO first"). Wow, It is 2 years old patch!

    Currently, tmpfs file cache is inserted active list at first. This means
    that the insertion doesn't only increase numbers of pages in anon LRU, but
    it also reduces anon scanning ratio. Therefore, vmscan will get totally
    confused. It scans almost only file LRU even though the system has plenty
    unused tmpfs pages.

    Historically, lru_cache_add_active_anon() was used for two reasons.
    1) Intend to priotize shmem page rather than regular file cache.
    2) Intend to avoid reclaim priority inversion of used once pages.

    But we've lost both motivation because (1) Now we have separate anon and
    file LRU list. then, to insert active list doesn't help such priotize.
    (2) In past, one pte access bit will cause page activation. then to
    insert inactive list with pte access bit mean higher priority than to
    insert active list. Its priority inversion may lead to uninteded lru
    chun. but it was already solved by commit 645747462 (vmscan: detect
    mapped file pages used only once). (Thanks Hannes, you are great!)

    Thus, now we can use lru_cache_add_anon() instead.

    Signed-off-by: KOSAKI Motohiro
    Reported-by: Shaohua Li
    Reviewed-by: Wu Fengguang
    Reviewed-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Acked-by: Hugh Dickins
    Cc: Henrique de Moraes Holschuh
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • This is similar to what already happens in the write case. If we have a short
    read while doing O_DIRECT, instead of just returning, fallthrough and try to
    read the rest via buffered IO. BTRFS needs this because if we encounter a
    compressed or inline extent during DIO, we need to fallback on buffered. If the
    extent is compressed we need to read the entire thing into memory and
    de-compress it into the users pages. I have tested this with fsx and everything
    works great. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • This is needed to enable moving pages into the page cache in fuse with
    splice(..., SPLICE_F_MOVE).

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     

30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

07 Mar, 2010

1 commit

  • Make sure compiler won't do weird things with limits. E.g. fetching them
    twice may return 2 different values after writable limits are implemented.

    I.e. either use rlimit helpers added in
    3e10e716abf3c71bdb5d86b8f507f9e72236c9cd ("resource: add helpers for
    fetching rlimits") or ACCESS_ONCE if not applicable.

    Signed-off-by: Jiri Slaby
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Slaby
     

04 Mar, 2010

1 commit

  • No one is calling this anymore as everyone has switched to
    invalidate_mapping_pages long time ago. Also update a few
    references to it in comments. nfs has two more, but I can't
    easily figure what they are actually referring to, so I left
    them as-is.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     

03 Feb, 2010

1 commit

  • The cache alias problem will happen if the changes of user shared mapping
    is not flushed before copying, then user and kernel mapping may be mapped
    into two different cache line, it is impossible to guarantee the coherence
    after iov_iter_copy_from_user_atomic. So the right steps should be:

    flush_dcache_page(page);
    kmap_atomic(page);
    write to page;
    kunmap_atomic(page);
    flush_dcache_page(page);

    More precisely, we might create two new APIs flush_dcache_user_page and
    flush_dcache_kern_page to replace the two flush_dcache_page accordingly.

    Here is a snippet tested on omap2430 with VIPT cache, and I think it is
    not ARM-specific:

    int val = 0x11111111;
    fd = open("abc", O_RDWR);
    addr = mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
    *(addr+0) = 0x44444444;
    tmp = *(addr+0);
    *(addr+1) = 0x77777777;
    write(fd, &val, sizeof(int));
    close(fd);

    The results are not always 0x11111111 0x77777777 at the beginning as expected. Sometimes we see 0x44444444 0x77777777.

    Signed-off-by: Anfei
    Cc: Russell King
    Cc: Miklos Szeredi
    Cc: Nick Piggin
    Cc:
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    anfei zhou
     

28 Jan, 2010

1 commit

  • It's a simplified 'read_cache_page()' which takes a page allocation
    flag, so that different paths can control how aggressive the memory
    allocations are that populate a address space.

    In particular, the intel GPU object mapping code wants to be able to do
    a certain amount of own internal memory management by automatically
    shrinking the address space when memory starts getting tight. This
    allows it to dynamically use different memory allocation policies on a
    per-allocation basis, rather than depend on the (static) address space
    gfp policy.

    The actual new function is a one-liner, but re-organizing the helper
    functions to the point where you can do this with a single line of code
    is what most of the patch is all about.

    Tested-by: Chris Wilson
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

17 Dec, 2009

1 commit

  • In the case of direct I/O falling back to buffered I/O we sync data
    twice currently: once at the end of generic_file_buffered_write using
    filemap_write_and_wait_range and once a little later in
    __generic_file_aio_write using do_sync_mapping_range with all flags set.

    The wait before write of the do_sync_mapping_range call does not make
    any sense, so just keep the filemap_write_and_wait_range call and move
    it to the right spot.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     

10 Dec, 2009

1 commit


04 Dec, 2009

1 commit

  • That is "success", "unknown", "through", "performance", "[re|un]mapping"
    , "access", "default", "reasonable", "[con]currently", "temperature"
    , "channel", "[un]used", "application", "example","hierarchy", "therefore"
    , "[over|under]flow", "contiguous", "threshold", "enough" and others.

    Signed-off-by: André Goddard Rosa
    Signed-off-by: Jiri Kosina

    André Goddard Rosa
     

28 Sep, 2009

1 commit


24 Sep, 2009

3 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
    truncate: use new helpers
    truncate: new helpers
    fs: fix overflow in sys_mount() for in-kernel calls
    fs: Make unload_nls() NULL pointer safe
    freeze_bdev: grab active reference to frozen superblocks
    freeze_bdev: kill bd_mount_sem
    exofs: remove BKL from super operations
    fs/romfs: correct error-handling code
    vfs: seq_file: add helpers for data filling
    vfs: remove redundant position check in do_sendfile
    vfs: change sb->s_maxbytes to a loff_t
    vfs: explicitly cast s_maxbytes in fiemap_check_ranges
    libfs: return error code on failed attr set
    seq_file: return a negative error code when seq_path_root() fails.
    vfs: optimize touch_time() too
    vfs: optimization for touch_atime()
    vfs: split generic_forget_inode() so that hugetlbfs does not have to copy it
    fs/inode.c: add dev-id and inode number for debugging in init_special_inode()
    libfs: make simple_read_from_buffer conventional

    Linus Torvalds
     
  • * 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (21 commits)
    HWPOISON: Enable error_remove_page on btrfs
    HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs
    HWPOISON: Add madvise() based injector for hardware poisoned pages v4
    HWPOISON: Enable error_remove_page for NFS
    HWPOISON: Enable .remove_error_page for migration aware file systems
    HWPOISON: The high level memory error handler in the VM v7
    HWPOISON: Add PR_MCE_KILL prctl to control early kill behaviour per process
    HWPOISON: shmem: call set_page_dirty() with locked page
    HWPOISON: Define a new error_remove_page address space op for async truncation
    HWPOISON: Add invalidate_inode_page
    HWPOISON: Refactor truncate to allow direct truncating of page v2
    HWPOISON: check and isolate corrupted free pages v2
    HWPOISON: Handle hardware poisoned pages in try_to_unmap
    HWPOISON: Use bitmask/action code for try_to_unmap behaviour
    HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler v2
    HWPOISON: Add poison check to page fault handling
    HWPOISON: Add basic support for poisoned pages in fault handler v3
    HWPOISON: Add new SIGBUS error codes for hardware poison signals
    HWPOISON: Add support for poison swap entries v2
    HWPOISON: Export some rmap vma locking to outside world
    ...

    Linus Torvalds
     
  • Introduce new truncate helpers truncate_pagecache and inode_newsize_ok.
    vmtruncate is also consolidated from mm/memory.c and mm/nommu.c and
    into mm/truncate.c.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Nick Piggin
    Signed-off-by: Al Viro

    npiggin@suse.de
     

22 Sep, 2009

1 commit

  • Recently we encountered OOM problems due to memory use of the GEM cache.
    Generally a large amuont of Shmem/Tmpfs pages tend to create a memory
    shortage problem.

    We often use the following calculation to determine the amount of shmem
    pages:

    shmem = NR_ACTIVE_ANON + NR_INACTIVE_ANON - NR_ANON_PAGES

    however the expression does not consider isolated and mlocked pages.

    This patch adds explicit accounting for pages used by shmem and tmpfs.

    Signed-off-by: KOSAKI Motohiro
    Acked-by: Rik van Riel
    Reviewed-by: Christoph Lameter
    Acked-by: Wu Fengguang
    Cc: David Rientjes
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

16 Sep, 2009

1 commit

  • Add the high level memory handler that poisons pages
    that got corrupted by hardware (typically by a two bit flip in a DIMM
    or a cache) on the Linux level. The goal is to prevent everyone
    from accessing these pages in the future.

    This done at the VM level by marking a page hwpoisoned
    and doing the appropriate action based on the type of page
    it is.

    The code that does this is portable and lives in mm/memory-failure.c

    To quote the overview comment:

    High level machine check handler. Handles pages reported by the
    hardware as being corrupted usually due to a 2bit ECC memory or cache
    failure.

    This focuses on pages detected as corrupted in the background.
    When the current CPU tries to consume corruption the currently
    running process can just be killed directly instead. This implies
    that if the error cannot be handled for some reason it's safe to
    just ignore it because no corruption has been consumed yet. Instead
    when that happens another machine check will happen.

    Handles page cache pages in various states. The tricky part
    here is that we can access any page asynchronous to other VM
    users, because memory failures could happen anytime and anywhere,
    possibly violating some of their assumptions. This is why this code
    has to be extremely careful. Generally it tries to use normal locking
    rules, as in get the standard locks, even if that means the
    error handling takes potentially a long time.

    Some of the operations here are somewhat inefficient and have non
    linear algorithmic complexity, because the data structures have not
    been optimized for this case. This is in particular the case
    for the mapping from a vma to a process. Since this case is expected
    to be rare we hope we can get away with this.

    There are in principle two strategies to kill processes on poison:
    - just unmap the data and wait for an actual reference before
    killing
    - kill as soon as corruption is detected.
    Both have advantages and disadvantages and should be used
    in different situations. Right now both are implemented and can
    be switched with a new sysctl vm.memory_failure_early_kill
    The default is early kill.

    The patch does some rmap data structure walking on its own to collect
    processes to kill. This is unusual because normally all rmap data structure
    knowledge is in rmap.c only. I put it here for now to keep
    everything together and rmap knowledge has been seeping out anyways

    Includes contributions from Johannes Weiner, Chris Mason, Fengguang Wu,
    Nick Piggin (who did a lot of great work) and others.

    Cc: npiggin@suse.de
    Cc: riel@redhat.com
    Signed-off-by: Andi Kleen
    Acked-by: Rik van Riel
    Reviewed-by: Hidehiro Kawai

    Andi Kleen
     

14 Sep, 2009

6 commits

  • Remove these three functions since nobody uses them anymore.

    Signed-off-by: Jan Kara

    Jan Kara
     
  • Introduce new function for generic inode syncing (vfs_fsync_range) and use
    it from fsync() path. Introduce also new helper for syncing after a sync
    write (generic_write_sync) using the generic function.

    Use these new helpers for syncing from generic VFS functions. This makes
    O_SYNC writes to block devices acquire i_mutex for syncing. If we really
    care about this, we can make block_fsync() drop the i_mutex and reacquire
    it before it returns.

    CC: Evgeniy Polyakov
    CC: ocfs2-devel@oss.oracle.com
    CC: Joel Becker
    CC: Felix Blyakher
    CC: xfs@oss.sgi.com
    CC: Anton Altaparmakov
    CC: linux-ntfs-dev@lists.sourceforge.net
    CC: OGAWA Hirofumi
    CC: linux-ext4@vger.kernel.org
    CC: tytso@mit.edu
    Acked-by: Christoph Hellwig
    Signed-off-by: Jan Kara

    Jan Kara
     
  • generic_file_aio_write_nolock() is now used only by block devices and raw
    character device. Filesystems should use __generic_file_aio_write() in case
    generic_file_aio_write() doesn't suit them. So rename the function to
    blkdev_aio_write() and move it to fs/blockdev.c.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jan Kara

    Christoph Hellwig
     
  • generic_file_direct_write() and generic_file_buffered_write() called
    generic_osync_inode() if it was called on O_SYNC file or IS_SYNC inode. But
    this is superfluous since generic_file_aio_write() does the syncing as well.
    Also XFS and OCFS2 which call these functions directly handle syncing
    themselves. So let's have a single place where syncing happens:
    generic_file_aio_write().

    We slightly change the behavior by syncing only the range of file to which the
    write happened for buffered writes but that should be all that is required.

    CC: ocfs2-devel@oss.oracle.com
    CC: Joel Becker
    CC: Felix Blyakher
    CC: xfs@oss.sgi.com
    Signed-off-by: Jan Kara

    Jan Kara
     
  • Rename __generic_file_aio_write_nolock() to __generic_file_aio_write(), add
    comments to write helpers explaining how they should be used and export
    __generic_file_aio_write() since it will be used by some filesystems.

    CC: ocfs2-devel@oss.oracle.com
    CC: Joel Becker
    Acked-by: Evgeniy Polyakov
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jan Kara

    Jan Kara
     
  • This simple helper saves some filesystems conversion from byte offset
    to page numbers and also makes the fdata* interface more complete.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jan Kara

    Jan Kara
     

07 Jul, 2009

1 commit

  • In testing a backport of the write_begin/write_end AOPs, a 10% re-read
    regression was noticed when running iozone. This regression was
    introduced because the old AOPs would always do a mark_page_accessed(page)
    after the commit_write, but when the new AOPs where introduced, the only
    place this was kept was in pagecache_write_end().

    This patch does the same thing in the generic case as what is done in
    pagecache_write_end(), which is just to mark the page accessed before we
    do write_end().

    Signed-off-by: Josef Bacik
    Acked-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Josef Bacik