15 Sep, 2009

1 commit

  • * 'for-2.6.32' of git://git.kernel.dk/linux-2.6-block: (29 commits)
    block: use blkdev_issue_discard in blk_ioctl_discard
    Make DISCARD_BARRIER and DISCARD_NOBARRIER writes instead of reads
    block: don't assume device has a request list backing in nr_requests store
    block: Optimal I/O limit wrapper
    cfq: choose a new next_req when a request is dispatched
    Seperate read and write statistics of in_flight requests
    aoe: end barrier bios with EOPNOTSUPP
    block: trace bio queueing trial only when it occurs
    block: enable rq CPU completion affinity by default
    cfq: fix the log message after dispatched a request
    block: use printk_once
    cciss: memory leak in cciss_init_one()
    splice: update mtime and atime on files
    block: make blk_iopoll_prep_sched() follow normal 0/1 return convention
    cfq-iosched: get rid of must_alloc flag
    block: use interrupts disabled version of raise_softirq_irqoff()
    block: fix comment in blk-iopoll.c
    block: adjust default budget for blk-iopoll
    block: fix long lines in block/blk-iopoll.c
    block: add blk-iopoll, a NAPI like approach for block devices
    ...

    Linus Torvalds
     

14 Sep, 2009

1 commit

  • Introduce new function for generic inode syncing (vfs_fsync_range) and use
    it from fsync() path. Introduce also new helper for syncing after a sync
    write (generic_write_sync) using the generic function.

    Use these new helpers for syncing from generic VFS functions. This makes
    O_SYNC writes to block devices acquire i_mutex for syncing. If we really
    care about this, we can make block_fsync() drop the i_mutex and reacquire
    it before it returns.

    CC: Evgeniy Polyakov
    CC: ocfs2-devel@oss.oracle.com
    CC: Joel Becker
    CC: Felix Blyakher
    CC: xfs@oss.sgi.com
    CC: Anton Altaparmakov
    CC: linux-ntfs-dev@lists.sourceforge.net
    CC: OGAWA Hirofumi
    CC: linux-ext4@vger.kernel.org
    CC: tytso@mit.edu
    Acked-by: Christoph Hellwig
    Signed-off-by: Jan Kara

    Jan Kara
     

11 Sep, 2009

1 commit

  • Splice should update the modification and access times on regular
    files just like read and write. Not updating mtime will confuse
    backup tools, etc...

    This patch only adds the time updates for regular files. For pipes
    and other special files that splice touches the need for updating the
    times is less clear. Let's discuss and fix that separately.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Jens Axboe

    Miklos Szeredi
     

19 May, 2009

1 commit

  • Unfortunately multiple kmap() within a single thread are deadlockable,
    so writing out multiple buffers with writev() isn't possible.

    Change the implementation so that it does a separate write() for each
    buffer. This actually simplifies the code a lot since the
    splice_from_pipe() helper can be used.

    This limitation is caused by HIGHMEM pages, and so only affects a
    subset of architectures and configurations. In the future it may be
    worth to implement default_file_splice_write() in a more efficient way
    on configs that allow it.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Jens Axboe

    Miklos Szeredi
     

14 May, 2009

1 commit

  • fs/splice.c: In function 'default_file_splice_read':
    fs/splice.c:566: warning: 'error' may be used uninitialized in this function

    which is sort-of true. The code will in fact return -ENOMEM instead of the
    kernel_readv() return value.

    Cc: Miklos Szeredi
    Signed-off-by: Andrew Morton
    Signed-off-by: Jens Axboe

    Andrew Morton
     

13 May, 2009

1 commit


11 May, 2009

3 commits

  • If f_op->splice_write() is not implemented, fall back to a plain write.
    Use vfs_writev() to write from the pipe buffers.

    This will allow splice on all filesystems and file types. This
    includes "direct_io" files in fuse which bypass the page cache.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Jens Axboe

    Miklos Szeredi
     
  • If f_op->splice_read() is not implemented, fall back to a plain read.
    Use vfs_readv() to read into previously allocated pages.

    This will allow splice and functions using splice, such as the loop
    device, to work on all filesystems. This includes "direct_io" files
    in fuse which bypass the page cache.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Jens Axboe

    Miklos Szeredi
     
  • Allow splice(2) to work when both the input and the output is a pipe.

    Based on the impementation of the tee(2) syscall, but instead of
    duplicating the buffer references move the buffers from the input pipe
    to the output pipe.

    Moving the whole buffer only succeeds if the full length of the buffer
    is spliced. Otherwise duplicate the buffer, just like tee(2), set the
    length of the output buffer and advance the offset on the input
    buffer.

    Since splice is operating on two pipes, special care needs to be taken
    with locking to prevent AN ABBA deadlock. Again this is done
    similarly to the tee(2) syscall, first preparing the input and output
    pipes so there's data to consume and space for that data, and then
    doing the move operation while holding both locks.

    If other processes are doing I/O on the same pipes parallel to the
    splice, then by the time both inodes are locked there might be no
    buffers left to move, or no space to move them to. In this case retry
    the whole operation, including the preparation phase. This could lead
    to starvation, but I'm not sure if that's serious enough to worry
    about.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Jens Axboe

    Miklos Szeredi
     

17 Apr, 2009

1 commit

  • splice: fix kernel-doc warnings

    Warning(fs/splice.c:617): bad line:
    Warning(fs/splice.c:722): No description found for parameter 'sd'
    Warning(fs/splice.c:722): Excess function parameter 'pipe' description in 'splice_from_pipe_begin'

    Signed-off-by: Randy Dunlap
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     

15 Apr, 2009

6 commits

  • There are lots of sequences like this, especially in splice code:

    if (pipe->inode)
    mutex_lock(&pipe->inode->i_mutex);
    /* do something */
    if (pipe->inode)
    mutex_unlock(&pipe->inode->i_mutex);

    so introduce helpers which do the conditional locking and unlocking.
    Also replace the inode_double_lock() call with a pipe_double_lock()
    helper to avoid spreading the use of this functionality beyond the
    pipe code.

    This patch is just a cleanup, and should cause no behavioral changes.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Jens Axboe

    Miklos Szeredi
     
  • Remove the now unused generic_file_splice_write_nolock() function.
    It's conceptually broken anyway, because splice may need to wait for
    pipe events so holding locks across the whole operation is wrong.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Jens Axboe

    Miklos Szeredi
     
  • Rearrange locking of i_mutex on destination and call to
    ocfs2_rw_lock() so locks are only held while buffers are copied with
    the pipe_to_file() actor, and not while waiting for more data on the
    pipe.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Jens Axboe

    Miklos Szeredi
     
  • Rearrange locking of i_mutex on destination so it's only held while
    buffers are copied with the pipe_to_file() actor, and not while
    waiting for more data on the pipe.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Jens Axboe

    Miklos Szeredi
     
  • splice_from_pipe() is only called from two places:

    - generic_splice_sendpage()
    - splice_write_null()

    Neither of these require i_mutex to be taken on the destination inode.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Jens Axboe

    Miklos Szeredi
     
  • Split up __splice_from_pipe() into four helper functions:

    splice_from_pipe_begin()
    splice_from_pipe_next()
    splice_from_pipe_feed()
    splice_from_pipe_end()

    splice_from_pipe_next() will wait (if necessary) for more buffers to
    be added to the pipe. splice_from_pipe_feed() will feed the buffers
    to the supplied actor and return when there's no more data available
    (or if all of the requested data has been copied).

    This is necessary so that implementations can do locking around the
    non-waiting splice_from_pipe_feed().

    This patch should not cause any change in behavior.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Jens Axboe

    Miklos Szeredi
     

07 Apr, 2009

1 commit

  • There's a possible deadlock in generic_file_splice_write(),
    splice_from_pipe() and ocfs2_file_splice_write():

    - task A calls generic_file_splice_write()
    - this calls inode_double_lock(), which locks i_mutex on both
    pipe->inode and target inode
    - ordering depends on inode pointers, can happen that pipe->inode is
    locked first
    - __splice_from_pipe() needs more data, calls pipe_wait()
    - this releases lock on pipe->inode, goes to interruptible sleep
    - task B calls generic_file_splice_write(), similarly to the first
    - this locks pipe->inode, then tries to lock inode, but that is
    already held by task A
    - task A is interrupted, it tries to lock pipe->inode, but fails, as
    it is already held by task B
    - ABBA deadlock

    Fix this by explicitly ordering locks: the outer lock must be on
    target inode and the inner lock (which is later unlocked and relocked)
    must be on pipe->inode. This is OK, pipe inodes and target inodes
    form two nonoverlapping sets, generic_file_splice_write() and friends
    are not called with a target which is a pipe.

    Signed-off-by: Miklos Szeredi
    Acked-by: Mark Fasheh
    Acked-by: Jens Axboe
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     

03 Apr, 2009

1 commit

  • Recruit a page flag to aid in cache management. The following extra flag is
    defined:

    (1) PG_fscache (PG_private_2)

    The marked page is backed by a local cache and is pinning resources in the
    cache driver.

    If PG_fscache is set, then things that checked for PG_private will now also
    check for that. This includes things like truncation and page invalidation.
    The function page_has_private() had been added to make the checks for both
    PG_private and PG_private_2 at the same time.

    Signed-off-by: David Howells
    Acked-by: Steve Dickson
    Acked-by: Trond Myklebust
    Acked-by: Rik van Riel
    Acked-by: Al Viro
    Tested-by: Daire Byrne

    David Howells
     

14 Jan, 2009

1 commit


09 Jan, 2009

1 commit

  • A big patch for changing memcg's LRU semantics.

    Now,
    - page_cgroup is linked to mem_cgroup's its own LRU (per zone).

    - LRU of page_cgroup is not synchronous with global LRU.

    - page and page_cgroup is one-to-one and statically allocated.

    - To find page_cgroup is on what LRU, you have to check pc->mem_cgroup as
    - lru = page_cgroup_zoneinfo(pc, nid_of_pc, zid_of_pc);

    - SwapCache is handled.

    And, when we handle LRU list of page_cgroup, we do following.

    pc = lookup_page_cgroup(page);
    lock_page_cgroup(pc); .....................(1)
    mz = page_cgroup_zoneinfo(pc);
    spin_lock(&mz->lru_lock);
    .....add to LRU
    spin_unlock(&mz->lru_lock);
    unlock_page_cgroup(pc);

    But (1) is spin_lock and we have to be afraid of dead-lock with zone->lru_lock.
    So, trylock() is used at (1), now. Without (1), we can't trust "mz" is correct.

    This is a trial to remove this dirty nesting of locks.
    This patch changes mz->lru_lock to be zone->lru_lock.
    Then, above sequence will be written as

    spin_lock(&zone->lru_lock); # in vmscan.c or swap.c via global LRU
    mem_cgroup_add/remove/etc_lru() {
    pc = lookup_page_cgroup(page);
    mz = page_cgroup_zoneinfo(pc);
    if (PageCgroupUsed(pc)) {
    ....add to LRU
    }
    spin_lock(&zone->lru_lock); # in vmscan.c or swap.c via global LRU

    This is much simpler.
    (*) We're safe even if we don't take lock_page_cgroup(pc). Because..
    1. When pc->mem_cgroup can be modified.
    - at charge.
    - at account_move().
    2. at charge
    the PCG_USED bit is not set before pc->mem_cgroup is fixed.
    3. at account_move()
    the page is isolated and not on LRU.

    Pros.
    - easy for maintenance.
    - memcg can make use of laziness of pagevec.
    - we don't have to duplicated LRU/Active/Unevictable bit in page_cgroup.
    - LRU status of memcg will be synchronized with global LRU's one.
    - # of locks are reduced.
    - account_move() is simplified very much.
    Cons.
    - may increase cost of LRU rotation.
    (no impact if memcg is not configured.)

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Cc: Balbir Singh
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

31 Oct, 2008

1 commit

  • Nothing uses prepare_write or commit_write. Remove them from the tree
    completely.

    [akpm@linux-foundation.org: schedule simple_prepare_write() for unexporting]
    Signed-off-by: Nick Piggin
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

10 Oct, 2008

1 commit

  • This is debatable, but while we're debating it, let's disallow the
    combination of splice and an O_APPEND destination.

    It's not entirely clear what the semantics of O_APPEND should be, and
    POSIX apparently expects pwrite() to ignore O_APPEND, for example. So
    we could make up any semantics we want, including the old ones.

    But Miklos convinced me that we should at least give it some thought,
    and that accepting writes at arbitrary offsets is wrong at least for
    IS_APPEND() files (which always have O_APPEND set, even if the reverse
    isn't true: you can obviously have O_APPEND set on a regular file).

    So disallow O_APPEND entirely for now. I doubt anybody cares, and this
    way we have one less gray area to worry about.

    Reported-and-argued-for-by: Miklos Szeredi
    Acked-by: Jens Axboe
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

05 Aug, 2008

1 commit

  • Converting page lock to new locking bitops requires a change of page flag
    operation naming, so we might as well convert it to something nicer
    (!TestSetPageLocked_Lock => trylock_page, SetPageLocked => set_page_locked).

    This also facilitates lockdeping of page lock.

    Signed-off-by: Nick Piggin
    Acked-by: KOSAKI Motohiro
    Acked-by: Peter Zijlstra
    Acked-by: Andrew Morton
    Acked-by: Benjamin Herrenschmidt
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

27 Jul, 2008

2 commits

  • All calls to remove_suid() are made with a file pointer, because
    (similarly to file_update_time) it is called when the file is written.

    Clean up callers by passing in a file instead of a dentry.

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • Use get_user_pages_fast in splice. This reverts some mmap_sem batching
    there, however the biggest problem with mmap_sem tends to be hold times
    blocking out other threads rather than cacheline bouncing. Further: on
    architectures that implement get_user_pages_fast without locks, mmap_sem
    can be avoided completely anyway.

    Signed-off-by: Nick Piggin
    Cc: Dave Kleikamp
    Cc: Andy Whitcroft
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: Andi Kleen
    Cc: Dave Kleikamp
    Cc: Badari Pulavarty
    Cc: Zach Brown
    Cc: Jens Axboe
    Reviewed-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

04 Jul, 2008

1 commit

  • If a page was invalidated during splicing from file to a pipe, then
    generic_file_splice_read() could return a short or zero count.

    This manifested itself in rare I/O errors seen on nfs exported fuse
    filesystems. This is because nfsd uses splice_direct_to_actor() to read
    files, and fuse uses invalidate_inode_pages2() to invalidate stale data on
    open.

    Fix by redoing the page find/create if it was found to be truncated
    (invalidated).

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Andrew Morton
    Signed-off-by: Jens Axboe

    Miklos Szeredi
     

28 May, 2008

2 commits


08 May, 2008

1 commit


07 May, 2008

1 commit

  • generic_file_splice_write() duplicates remove_suid() just because it
    doesn't hold i_mutex. But it grabs i_mutex inside splice_from_pipe()
    anyway, so this is rather pointless.

    Move locking to generic_file_splice_write() and call remove_suid() and
    __splice_from_pipe() instead.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Jens Axboe

    Miklos Szeredi
     

29 Apr, 2008

1 commit


10 Apr, 2008

1 commit

  • There's a quirky loop in generic_file_splice_read() that could go
    on indefinitely, if the file splice returns 0 permanently (and not
    just as a temporary condition). Get rid of the loop and pass
    back -EAGAIN correctly from __generic_file_splice_read(), so we
    handle that condition properly as well.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

04 Apr, 2008

1 commit

  • The loop block driver is careful to mask __GFP_IO|__GFP_FS out of its
    mapping_gfp_mask, to avoid hangs under memory pressure. But nowadays
    it uses splice, usually going through __generic_file_splice_read. That
    must use mapping_gfp_mask instead of GFP_KERNEL to avoid those hangs.

    Signed-off-by: Hugh Dickins
    Cc: Jens Axboe
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

04 Mar, 2008

1 commit

  • sys_tee() currently is a bit eager in returning -EAGAIN, it may do so
    even if we don't have a chance of anymore data becoming available. So
    improve the logic and only return -EAGAIN if we have an attached writer
    to the input pipe.

    Reported by Johann Felix Soden and
    Patrick McManus .

    Tested-by: Johann Felix Soden
    Signed-off-by: Jens Axboe

    Jens Axboe
     

11 Feb, 2008

1 commit

  • Commit 8811930dc74a503415b35c4a79d14fb0b408a361 ("splice: missing user
    pointer access verification") added the proper access_ok() calls to
    copy_from_user_mmap_sem() which ensures we can copy the struct iovecs
    from userspace to the kernel.

    But we also must check whether we can access the actual memory region
    pointed to by the struct iovec to fix the access checks properly.

    Signed-off-by: Bastian Blank
    Acked-by: Oliver Pinter
    Cc: Jens Axboe
    Cc: Andrew Morton
    Signed-off-by: Pekka Enberg
    Signed-off-by: Linus Torvalds

    Bastian Blank
     

09 Feb, 2008

1 commit

  • vmsplice_to_user() must always check the user pointer and length
    with access_ok() before copying. Likewise, for the slow path of
    copy_from_user_mmap_sem() we need to check that we may read from
    the user region.

    Signed-off-by: Jens Axboe
    Cc: Wojciech Purczynski
    Signed-off-by: Greg Kroah-Hartman
    Signed-off-by: Linus Torvalds

    Jens Axboe
     

01 Feb, 2008

1 commit

  • Andre Majorel points out that if we only updated
    the atime when we transfer some data, we deviate from the standard
    of always updating the atime. So change splice to always call
    file_accessed() even if splice_direct_to_actor() didn't transfer
    any data.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

30 Jan, 2008

1 commit

  • A bug report on nfsd that states that since it was switched to use
    splice instead of sendfile, the atime was no longer being updated
    on the input file. do_generic_mapping_read() does this when accessing
    the file, make splice do it for the direct splice handler.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

29 Jan, 2008

1 commit


25 Jan, 2008

1 commit