15 May, 2008

1 commit

  • filemap_fault will go into an infinite loop if ->readpage() fails
    asynchronously.

    AFAICS the bug was introduced by this commit, which removed the wait after the
    final readpage:

    commit d00806b183152af6d24f46f0c33f14162ca1262a
    Author: Nick Piggin
    Date: Thu Jul 19 01:46:57 2007 -0700

    mm: fix fault vs invalidate race for linear mappings

    Fix by reintroducing the wait_on_page_locked() after ->readpage() to make sure
    the page is up-to-date before jumping back to the beginning of the function.

    I've noticed this while testing nfs exporting on fuse. The patch
    fixes it.

    Signed-off-by: Miklos Szeredi
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     

07 May, 2008

1 commit

  • generic_file_splice_write() duplicates remove_suid() just because it
    doesn't hold i_mutex. But it grabs i_mutex inside splice_from_pipe()
    anyway, so this is rather pointless.

    Move locking to generic_file_splice_write() and call remove_suid() and
    __splice_from_pipe() instead.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Jens Axboe

    Miklos Szeredi
     

28 Apr, 2008

1 commit

  • Clean up messy conditional calling of test_clear_page_writeback() from both
    rotate_reclaimable_page() and end_page_writeback().

    The only user of rotate_reclaimable_page() is end_page_writeback() so this is
    OK.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     

20 Mar, 2008

1 commit

  • Fix various kernel-doc notation in mm/:

    filemap.c: add function short description; convert 2 to kernel-doc
    fremap.c: change parameter 'prot' to @prot
    pagewalk.c: change "-" in function parameters to ":"
    slab.c: fix short description of kmem_ptr_validate()
    swap.c: fix description & parameters of put_pages_list()
    swap_state.c: fix function parameters
    vmalloc.c: change "@returns" to "Returns:" since that is not a parameter

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     

11 Mar, 2008

1 commit

  • iov_iter_advance() skips over zero-length iovecs, however it does not properly
    terminate at the end of the iovec array. Fix this by checking against
    i->count before we skip a zero-length iov.

    The bug was reproduced with a test program that continually randomly creates
    iovs to writev. The fix was also verified with the same program and also it
    could verify that the correct data was contained in the file after each
    writev.

    Signed-off-by: Nick Piggin
    Tested-by: "Kevin Coffman"
    Cc: "Alexey Dobriyan"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

10 Mar, 2008

1 commit


14 Feb, 2008

1 commit


09 Feb, 2008

2 commits

  • do_generic_mapping_read was used by gfs2 for internals reads, but this use
    of the interface was rather suboptimal (as was the whole interface) and has
    been replaced by an internal helper now. This patch kills
    do_generic_mapping_read and surrounding damage in preparation of additional
    cleanups for the buffered read path.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Convert variables containing page indexes to pgoff_t.

    Signed-off-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

08 Feb, 2008

5 commits

  • Need to strip __GFP_HIGHMEM flag while passing to mem_container_cache_charge().

    Signed-off-by: Badari Pulavarty
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Badari Pulavarty
     
  • Move mem_controller_cache_charge() above radix_tree_preload().
    radix_tree_preload() disables preemption, even though the gfp_mask passed
    contains __GFP_WAIT, we cannot really do __GFP_WAIT allocations, thus we
    hit a BUG_ON() in kmem_cache_alloc().

    This patch moves mem_controller_cache_charge() to above radix_tree_preload()
    for cache charging.

    Signed-off-by: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     
  • Nick Piggin pointed out that swap cache and page cache addition routines
    could be called from non GFP_KERNEL contexts. This patch makes the
    charging routine aware of the gfp context. Charging might fail if the
    cgroup is over it's limit, in which case a suitable error is returned.

    This patch was tested on a Powerpc box. I am still looking at being able
    to test the path, through which allocations happen in non GFP_KERNEL
    contexts.

    [kamezawa.hiroyu@jp.fujitsu.com: problem with ZONE_MOVABLE]
    Signed-off-by: Balbir Singh
    Cc: Pavel Emelianov
    Cc: Paul Menage
    Cc: Peter Zijlstra
    Cc: "Eric W. Biederman"
    Cc: Nick Piggin
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: David Rientjes
    Cc: Vaidyanathan Srinivasan
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     
  • Choose if we want cached pages to be accounted or not. By default both are
    accounted for. A new set of tunables are added.

    echo -n 1 > mem_control_type

    switches the accounting to account for only mapped pages

    echo -n 3 > mem_control_type

    switches the behaviour back

    [bunk@kernel.org: mm/memcontrol.c: clenups]
    [akpm@linux-foundation.org: fix sparc32 build]
    Signed-off-by: Balbir Singh
    Cc: Pavel Emelianov
    Cc: Paul Menage
    Cc: Peter Zijlstra
    Cc: "Eric W. Biederman"
    Cc: Nick Piggin
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: David Rientjes
    Cc: Vaidyanathan Srinivasan
    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     
  • Add the accounting hooks. The accounting is carried out for RSS and Page
    Cache (unmapped) pages. There is now a common limit and accounting for both.
    The RSS accounting is accounted at page_add_*_rmap() and page_remove_rmap()
    time. Page cache is accounted at add_to_page_cache(),
    __delete_from_page_cache(). Swap cache is also accounted for.

    Each page's page_cgroup is protected with the last bit of the
    page_cgroup pointer, this makes handling of race conditions involving
    simultaneous mappings of a page easier. A reference count is kept in the
    page_cgroup to deal with cases where a page might be unmapped from the RSS
    of all tasks, but still lives in the page cache.

    Credits go to Vaidyanathan Srinivasan for helping with reference counting work
    of the page cgroup. Almost all of the page cache accounting code has help
    from Vaidyanathan Srinivasan.

    [hugh@veritas.com: fix swapoff breakage]
    [akpm@linux-foundation.org: fix locking]
    Signed-off-by: Vaidyanathan Srinivasan
    Signed-off-by: Balbir Singh
    Cc: Pavel Emelianov
    Cc: Paul Menage
    Cc: Peter Zijlstra
    Cc: "Eric W. Biederman"
    Cc: Nick Piggin
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: David Rientjes
    Cc:
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     

06 Feb, 2008

2 commits

  • fastcall is always defined to be empty, remove it

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Harvey Harrison
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Harvey Harrison
     
  • Most pagecache (and some other) radix tree insertions have the great
    opportunity to preallocate a few nodes with relaxed gfp flags. But the
    preallocation is squandered when it comes time to allocate a node, we
    default to first attempting a GFP_ATOMIC allocation -- that doesn't
    normally fail, but it can eat into atomic memory reserves that we don't
    need to be using.

    Another upshot of this is that it removes the sometimes highly contended
    zone->lock from underneath tree_lock. Pagecache insertions are always
    performed with a radix tree preload, and after this change, such a
    situation will never fall back to kmem_cache_alloc within
    radix_tree_node_alloc.

    David Miller reports seeing this allocation fail on a highly threaded
    sparc64 system:

    [527319.459981] dd: page allocation failure. order:0, mode:0x20
    [527319.460403] Call Trace:
    [527319.460568] [00000000004b71e0] __slab_alloc+0x1b0/0x6a8
    [527319.460636] [00000000004b7bbc] kmem_cache_alloc+0x4c/0xa8
    [527319.460698] [000000000055309c] radix_tree_node_alloc+0x20/0x90
    [527319.460763] [0000000000553238] radix_tree_insert+0x12c/0x260
    [527319.460830] [0000000000495cd0] add_to_page_cache+0x38/0xb0
    [527319.460893] [00000000004e4794] mpage_readpages+0x6c/0x134
    [527319.460955] [000000000049c7fc] __do_page_cache_readahead+0x170/0x280
    [527319.461028] [000000000049cc88] ondemand_readahead+0x208/0x214
    [527319.461094] [0000000000496018] do_generic_mapping_read+0xe8/0x428
    [527319.461152] [0000000000497948] generic_file_aio_read+0x108/0x170
    [527319.461217] [00000000004badac] do_sync_read+0x88/0xd0
    [527319.461292] [00000000004bb5cc] vfs_read+0x78/0x10c
    [527319.461361] [00000000004bb920] sys_read+0x34/0x60
    [527319.461424] [0000000000406294] linux_sparc_syscall32+0x3c/0x40

    The calltrace is significant: __do_page_cache_readahead allocates a number
    of pages with GFP_KERNEL, and hence it should have reclaimed sufficient
    memory to satisfy GFP_ATOMIC allocations. However after the list of pages
    goes to mpage_readpages, there can be significant intervals (including disk
    IO) before all the pages are inserted into the radix-tree. So the reserves
    can easily be depleted at that point. The patch is confirmed to fix the
    problem.

    Signed-off-by: Nick Piggin
    Cc: "David S. Miller"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

03 Feb, 2008

1 commit

  • Frederik Himpe reported an unkillable and un-straceable pan process.

    Zero length iovecs can go into an infinite loop in writev, because the
    iovec iterator does not always advance over them.

    The sequence required to trigger this is not trivial. I think it
    requires that a zero-length iovec be followed by a non-zero-length iovec
    which causes a pagefault in the atomic usercopy. This causes the writev
    code to drop back into single-segment copy mode, which then tries to
    copy the 0 bytes of the zero-length iovec; a zero length copy looks like
    a failure though, so it loops.

    Put a test into iov_iter_advance to catch zero-length iovecs. We could
    just put the test in the fallback path, but I feel it is more robust to
    skip over zero-length iovecs throughout the code (iovec iterator may be
    used in filesystems too, so it should be robust).

    Signed-off-by: Nick Piggin
    Signed-off-by: Ingo Molnar
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

01 Feb, 2008

1 commit

  • * 'task_killable' of git://git.kernel.org/pub/scm/linux/kernel/git/willy/misc: (22 commits)
    Remove commented-out code copied from NFS
    NFS: Switch from intr mount option to TASK_KILLABLE
    Add wait_for_completion_killable
    Add wait_event_killable
    Add schedule_timeout_killable
    Use mutex_lock_killable in vfs_readdir
    Add mutex_lock_killable
    Use lock_page_killable
    Add lock_page_killable
    Add fatal_signal_pending
    Add TASK_WAKEKILL
    exit: Use task_is_*
    signal: Use task_is_*
    sched: Use task_contributes_to_load, TASK_ALL and TASK_NORMAL
    ptrace: Use task_is_*
    power: Use task_is_*
    wait: Use TASK_NORMAL
    proc/base.c: Use task_is_*
    proc/array.c: Use TASK_REPORT
    perfmon: Use task_is_*
    ...

    Fixed up conflicts in NFS/sunrpc manually..

    Linus Torvalds
     

20 Dec, 2007

1 commit

  • Krzysztof Oledzki noticed a dirty page accounting leak on some of his
    machines, causing the machine to eventually lock up when the kernel
    decided that there was too much dirty data, but nobody could actually
    write anything out to fix it.

    The culprit turns out to be filesystems (cough ext3 with data=journal
    cough) that re-dirty the page when the "->invalidatepage()" callback is
    called.

    Fix it up by doing a final dirty page accounting check when we actually
    remove the page from the page cache.

    This fixes bugzilla entry 9182:

    http://bugzilla.kernel.org/show_bug.cgi?id=9182

    Tested-by: Ingo Molnar
    Tested-by: Krzysztof Oledzki
    Cc: Andrew Morton
    Cc: Nick Piggin
    Cc: Peter Zijlstra
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

07 Dec, 2007

2 commits


01 Nov, 2007

1 commit

  • The kernel has for random historical reasons allowed ptrace() accesses
    to access (and insert) pages into the page cache above the size of the
    file.

    However, Nick broke that by mistake when doing the new fault handling in
    commit 54cb8821de07f2ffcd28c380ce9b93d5784b40d7 ("mm: merge populate and
    nopage into fault (fixes nonlinear)". The breakage caused a hang with
    gdb when trying to access the invalid page.

    The ptrace "feature" really isn't worth resurrecting, since it really is
    wrong both from a portability _and_ from an internal page cache validity
    standpoint. So this removes those old broken remnants, and fixes the
    ptrace() hang in the process.

    Noticed and bisected by Duane Griffin, who also supplied a test-case
    (quoth Nick: "Well that's probably the best bug report I've ever had,
    thanks Duane!").

    Cc: Duane Griffin
    Acked-by: Nick Piggin
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

31 Oct, 2007

1 commit

  • Commit commit 65b8291c4000e5f38fc94fb2ca0cb7e8683c8a1b ("dio: invalidate
    clean pages before dio write") introduced a bug which stopped dio from
    ever invalidating the page cache after writes. It still invalidated it
    before writes so most users were fine.

    Karl Schendel reported ( http://lkml.org/lkml/2007/10/26/481 ) hitting
    this bug when he had a buffered reader immediately reading file data
    after an O_DIRECT wirter had written the data. The kernel issued
    read-ahead beyond the position of the reader which overlapped with the
    O_DIRECT writer. The failure to invalidate after writes caused the
    reader to see stale data from the read-ahead.

    The following patch is originally from Karl. The following commentary
    is his:

    The below 3rd try takes on your suggestion of just invalidating
    no matter what the retval from the direct_IO call. I ran it
    thru the test-case several times and it has worked every time.
    The post-invalidate is probably still too early for async-directio,
    but I don't have a testcase for that; just sync. And, this
    won't be any worse in the async case.

    I added a test to the aio-dio-regress repository which mimics Karl's IO
    pattern. It verifed the bad behaviour and that the patch fixed it. I
    agree with Karl, this still doesn't help the case where a buffered
    reader follows an AIO O_DIRECT writer. That will require a bit more
    work.

    This gives up on the idea of returning EIO to indicate to userspace that
    stale data remains if the invalidation failed.

    Signed-off-by: Zach Brown
    Cc: Karl Schendel
    Cc: Benjamin LaHaise
    Cc: Andrew Morton
    Cc: Nick Piggin
    Cc: Leonid Ananiev
    Cc: Chris Mason
    Signed-off-by: Linus Torvalds

    Zach Brown
     

29 Oct, 2007

1 commit

  • mm/filemap.c: In function '__filemap_fdatawrite_range':
    mm/filemap.c:200: error: implicit declaration of function
    'mapping_cap_writeback_dirty'

    This happens when we don't use/have any block devices and a NFS root
    filesystem is used.

    mapping_cap_writeback_dirty() is defined in linux/backing-dev.h which
    used to be provided in mm/filemap.c by linux/blkdev.h until commit
    f5ff8422bbdd59f8c1f699df248e1b7a11073027 (Fix warnings with
    !CONFIG_BLOCK).

    Signed-off-by: Emil Medve
    Signed-off-by: Jens Axboe

    Emil Medve
     

20 Oct, 2007

1 commit

  • Fix kernel-api docbook contents problems.

    docproc: linux-2.6.23-git13/include/asm-x86/unaligned_32.h: No such file or directory
    Warning(linux-2.6.23-git13//include/linux/list.h:482): bad line: of list entry
    Warning(linux-2.6.23-git13//mm/filemap.c:864): No description found for parameter 'ra'
    Warning(linux-2.6.23-git13//block/ll_rw_blk.c:3760): No description found for parameter 'req'
    Warning(linux-2.6.23-git13//include/linux/input.h:1077): No description found for parameter 'private'
    Warning(linux-2.6.23-git13//include/linux/input.h:1077): No description found for parameter 'cdev'

    Signed-off-by: Randy Dunlap
    Cc: Jens Axboe
    Cc: WU Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     

19 Oct, 2007

1 commit

  • It gets it indirectly from blkdev.h when CONFIG_BLOCK is enabled, but it
    needs it unconditionally for the definition of mapping_cap_writeback_dirty.

    Noticed and bisected down to 4af3c9cc4fad54c3627e9afebf905aafde5690ed
    ("Drop some headers from mm.h") by Avuton Olrich.

    Cc: Avuton Olrich
    Cc: Alexey Dobriyan
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

17 Oct, 2007

14 commits

  • Implement file posix capabilities. This allows programs to be given a
    subset of root's powers regardless of who runs them, without having to use
    setuid and giving the binary all of root's powers.

    This version works with Kaigai Kohei's userspace tools, found at
    http://www.kaigai.gr.jp/index.php. For more information on how to use this
    patch, Chris Friedhoff has posted a nice page at
    http://www.friedhoff.org/fscaps.html.

    Changelog:
    Nov 27:
    Incorporate fixes from Andrew Morton
    (security-introduce-file-caps-tweaks and
    security-introduce-file-caps-warning-fix)
    Fix Kconfig dependency.
    Fix change signaling behavior when file caps are not compiled in.

    Nov 13:
    Integrate comments from Alexey: Remove CONFIG_ ifdef from
    capability.h, and use %zd for printing a size_t.

    Nov 13:
    Fix endianness warnings by sparse as suggested by Alexey
    Dobriyan.

    Nov 09:
    Address warnings of unused variables at cap_bprm_set_security
    when file capabilities are disabled, and simultaneously clean
    up the code a little, by pulling the new code into a helper
    function.

    Nov 08:
    For pointers to required userspace tools and how to use
    them, see http://www.friedhoff.org/fscaps.html.

    Nov 07:
    Fix the calculation of the highest bit checked in
    check_cap_sanity().

    Nov 07:
    Allow file caps to be enabled without CONFIG_SECURITY, since
    capabilities are the default.
    Hook cap_task_setscheduler when !CONFIG_SECURITY.
    Move capable(TASK_KILL) to end of cap_task_kill to reduce
    audit messages.

    Nov 05:
    Add secondary calls in selinux/hooks.c to task_setioprio and
    task_setscheduler so that selinux and capabilities with file
    cap support can be stacked.

    Sep 05:
    As Seth Arnold points out, uid checks are out of place
    for capability code.

    Sep 01:
    Define task_setscheduler, task_setioprio, cap_task_kill, and
    task_setnice to make sure a user cannot affect a process in which
    they called a program with some fscaps.

    One remaining question is the note under task_setscheduler: are we
    ok with CAP_SYS_NICE being sufficient to confine a process to a
    cpuset?

    It is a semantic change, as without fsccaps, attach_task doesn't
    allow CAP_SYS_NICE to override the uid equivalence check. But since
    it uses security_task_setscheduler, which elsewhere is used where
    CAP_SYS_NICE can be used to override the uid equivalence check,
    fixing it might be tough.

    task_setscheduler
    note: this also controls cpuset:attach_task. Are we ok with
    CAP_SYS_NICE being used to confine to a cpuset?
    task_setioprio
    task_setnice
    sys_setpriority uses this (through set_one_prio) for another
    process. Need same checks as setrlimit

    Aug 21:
    Updated secureexec implementation to reflect the fact that
    euid and uid might be the same and nonzero, but the process
    might still have elevated caps.

    Aug 15:
    Handle endianness of xattrs.
    Enforce capability version match between kernel and disk.
    Enforce that no bits beyond the known max capability are
    set, else return -EPERM.
    With this extra processing, it may be worth reconsidering
    doing all the work at bprm_set_security rather than
    d_instantiate.

    Aug 10:
    Always call getxattr at bprm_set_security, rather than
    caching it at d_instantiate.

    [morgan@kernel.org: file-caps clean up for linux/capability.h]
    [bunk@kernel.org: unexport cap_inode_killpriv]
    Signed-off-by: Serge E. Hallyn
    Cc: Stephen Smalley
    Cc: James Morris
    Cc: Chris Wright
    Cc: Andrew Morgan
    Signed-off-by: Andrew Morgan
    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     
  • zone->lock is quite an "inner" lock and mostly constrained to page alloc as
    well, so like slab locks, it probably isn't something that is critically
    important to document here. However unlike slab locks, zone lock could be
    used more widely in future, and page_alloc.c might possibly have more
    business to do tricky things with pagecache than does slab. So... I don't
    think it hurts to document it.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • prepare/commit_write no longer returns AOP_TRUNCATED_PAGE since OCFS2 and
    GFS2 were converted to the new aops, so we can make some simplifications
    for that.

    [michal.k.k.piotrowski@gmail.com: fix warning]
    Signed-off-by: Nick Piggin
    Cc: Michael Halcrow
    Cc: Mark Fasheh
    Cc: Steven Whitehouse
    Signed-off-by: Michal Piotrowski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Rework the generic block "cont" routines to handle the new aops. Supporting
    cont_prepare_write would take quite a lot of code to support, so remove it
    instead (and we later convert all filesystems to use it).

    write_begin gets passed AOP_FLAG_CONT_EXPAND when called from
    generic_cont_expand, so filesystems can avoid the old hacks they used.

    Signed-off-by: Nick Piggin
    Cc: OGAWA Hirofumi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Restore the KERNEL_DS optimisation, especially helpful to the 2copy write
    path.

    This may be a pretty questionable gain in most cases, especially after the
    legacy 2copy write path is removed, but it doesn't cost much.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • These are intended to replace prepare_write and commit_write with more
    flexible alternatives that are also able to avoid the buffered write
    deadlock problems efficiently (which prepare_write is unable to do).

    [mark.fasheh@oracle.com: API design contributions, code review and fixes]
    [akpm@linux-foundation.org: various fixes]
    [dmonakhov@sw.ru: new aop block_write_begin fix]
    Signed-off-by: Nick Piggin
    Signed-off-by: Mark Fasheh
    Signed-off-by: Dmitriy Monakhov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Add an iterator data structure to operate over an iovec. Add usercopy
    operators needed by generic_file_buffered_write, and convert that function
    over.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Modify the core write() code so that it won't take a pagefault while holding a
    lock on the pagecache page. There are a number of different deadlocks possible
    if we try to do such a thing:

    1. generic_buffered_write
    2. lock_page
    3. prepare_write
    4. unlock_page+vmtruncate
    5. copy_from_user
    6. mmap_sem(r)
    7. handle_mm_fault
    8. lock_page (filemap_nopage)
    9. commit_write
    10. unlock_page

    a. sys_munmap / sys_mlock / others
    b. mmap_sem(w)
    c. make_pages_present
    d. get_user_pages
    e. handle_mm_fault
    f. lock_page (filemap_nopage)

    2,8 - recursive deadlock if page is same
    2,8;2,8 - ABBA deadlock is page is different
    2,6;b,f - ABBA deadlock if page is same

    The solution is as follows:
    1. If we find the destination page is uptodate, continue as normal, but use
    atomic usercopies which do not take pagefaults and do not zero the uncopied
    tail of the destination. The destination is already uptodate, so we can
    commit_write the full length even if there was a partial copy: it does not
    matter that the tail was not modified, because if it is dirtied and written
    back to disk it will not cause any problems (uptodate *means* that the
    destination page is as new or newer than the copy on disk).

    1a. The above requires that fault_in_pages_readable correctly returns access
    information, because atomic usercopies cannot distinguish between
    non-present pages in a readable mapping, from lack of a readable mapping.

    2. If we find the destination page is non uptodate, unlock it (this could be
    made slightly more optimal), then allocate a temporary page to copy the
    source data into. Relock the destination page and continue with the copy.
    However, instead of a usercopy (which might take a fault), copy the data
    from the pinned temporary page via the kernel address space.

    (also, rename maxlen to seglen, because it was confusing)

    This increases the CPU/memory copy cost by almost 50% on the affected
    workloads. That will be solved by introducing a new set of pagecache write
    aops in a subsequent patch.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Hide some of the open-coded nr_segs tests into the iovec helpers. This is all
    to simplify generic_file_buffered_write, because that gets more complex in the
    next patch.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Quite a bit of code is used in maintaining these "cached pages" that are
    probably pretty unlikely to get used. It would require a narrow race where
    the page is inserted concurrently while this process is allocating a page
    in order to create the spare page. Then a multi-page write into an uncached
    part of the file, to make use of it.

    Next, the buffered write path (and others) uses its own LRU pagevec when it
    should be just using the per-CPU LRU pagevec (which will cut down on both data
    and code size cacheline footprint). Also, these private LRU pagevecs are
    emptied after just a very short time, in contrast with the per-CPU pagevecs
    that are persistent. Net result: 7.3 times fewer lru_lock acquisitions required
    to add the pages to pagecache for a bulk write (in 4K chunks).

    [this gets rid of some cond_resched() calls in readahead.c and mpage.c due
    to clashes in -mm. What put them there, and why? ]

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • If prepare_write fails with AOP_TRUNCATED_PAGE, or if commit_write fails, then
    we may have failed the write operation despite prepare_write having
    instantiated blocks past i_size. Fix this, and consolidate the trimming into
    one place.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Allow CONFIG_DEBUG_VM to switch off the prefaulting logic, to simulate the
    Makes the race much easier to hit.

    This is useful for demonstration and testing purposes, but is removed in a
    subsequent patch.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Rename some variables and fix some types.

    Signed-off-by: Andrew Morton
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • This reverts commit 6527c2bdf1f833cc18e8f42bd97973d583e4aa83, which
    fixed the following bug:

    When prefaulting in the pages in generic_file_buffered_write(), we only
    faulted in the pages for the firts segment of the iovec. If the second of
    successive segment described a mmapping of the page into which we're
    write()ing, and that page is not up-to-date, the fault handler tries to lock
    the already-locked page (to bring it up to date) and deadlocks.

    An exploit for this bug is in writev-deadlock-demo.c, in
    http://www.zip.com.au/~akpm/linux/patches/stuff/ext3-tools.tar.gz.

    (These demos assume blocksize < PAGE_CACHE_SIZE).

    The problem with this fix is that it takes the kernel back to doing a single
    prepare_write()/commit_write() per iovec segment. So in the worst case we'll
    run prepare_write+commit_write 1024 times where we previously would have run
    it once. The other problem with the fix is that it fix all the locking problems.

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton