30 Apr, 2013

1 commit


28 Mar, 2013

1 commit

  • Commit 06ae43f34bcc ("Don't bother with redoing rw_verify_area() from
    default_file_splice_from()") lost the checks to test existence of the
    write/aio_write methods. My apologies ;-/

    Eventually, we want that in fs/splice.c side of things (no point
    repeating it for every buffer, after all), but for now this is the
    obvious minimal fix.

    Reported-by: Dave Jones
    Signed-off-by: Al Viro
    Signed-off-by: Linus Torvalds

    Al Viro
     

22 Mar, 2013

1 commit

  • default_file_splice_from() ends up calling vfs_write() (via very convoluted
    callchain). It's an overkill, since we already have done rw_verify_area()
    in the caller by the time we call vfs_write() we are under set_fs(KERNEL_DS),
    so access_ok() is also pointless. Add a new helper (__kernel_write()),
    use it instead of kernel_write() in there.

    Signed-off-by: Al Viro

    Al Viro
     

03 Mar, 2013

1 commit

  • Pull signal/compat fixes from Al Viro:
    "Fixes for several regressions introduced in the last signal.git pile,
    along with fixing bugs in truncate and ftruncate compat (on just about
    anything biarch at least one of those two had been done wrong)."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
    compat: restore timerfd settime and gettime compat syscalls
    [regression] braino in "sparc: convert to ksignal"
    fix compat truncate/ftruncate
    switch lseek to COMPAT_SYSCALL_DEFINE
    lseek() and truncate() on sparc really need sign extension

    Linus Torvalds
     

24 Feb, 2013

1 commit


23 Feb, 2013

1 commit


21 Dec, 2012

1 commit

  • do_sendfile() in fs/read_write.c does not call the fsnotify functions,
    unlike its neighbors. This manifests as a lack of inotify ACCESS events
    when a file is sent using sendfile(2).

    Addresses
    https://bugzilla.kernel.org/show_bug.cgi?id=12812

    [akpm@linux-foundation.org: use fsnotify_modify(out.file), not fsnotify_access(), per Dave]
    Signed-off-by: Alan Cox
    Cc: Dave Chinner
    Cc: Jens Axboe
    Cc: Scott Wolchok
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Scott Wolchok
     

18 Dec, 2012

1 commit


03 Oct, 2012

1 commit

  • This function is used by sparc, powerpc and arm64 for compat support.
    The patch adds a generic implementation which calls do_sendfile()
    directly and avoids set_fs().

    The sparc architecture has wrappers for the sign extensions while
    powerpc relies on the compiler to do the this. The patch adds wrappers
    for powerpc to handle the u32->int type conversion.

    compat_sys_sendfile64() can be replaced by a sys_sendfile() call since
    compat_loff_t has the same size as off_t on a 64-bit system.

    On powerpc, the patch also changes the 64-bit sendfile call from
    sys_sendile64 to sys_sendfile.

    Signed-off-by: Catalin Marinas
    Acked-by: David S. Miller
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Alexander Viro
    Cc: Andrew Morton
    Signed-off-by: Al Viro

    Catalin Marinas
     

27 Sep, 2012

1 commit


23 Jul, 2012

1 commit

  • For ext3/4 htree directories, using the vfs llseek function with
    SEEK_END goes to i_size like for any other file, but in reality
    we want the maximum possible hash value. Recent changes
    in ext4 have cut & pasted generic_file_llseek() back into fs/ext4/dir.c,
    but replicating this core code seems like a bad idea, especially
    since the copy has already diverged from the vfs.

    This patch updates generic_file_llseek_size to accept
    both a custom maximum offset, and a custom EOF position. With this
    in place, ext4_dir_llseek can pass in the appropriate maximum hash
    position for both maxsize and eof, and get what it wants.

    As far as I know, this does not fix any bugs - nfs in the kernel
    doesn't use SEEK_END, and I don't know of any user who does. But
    some ext4 folks seem keen on doing the right thing here, and I can't
    really argue.

    (Patch also fixes up some comments slightly)

    Signed-off-by: Eric Sandeen
    Signed-off-by: Al Viro

    Eric Sandeen
     

01 Jun, 2012

1 commit

  • A cleanup of rw_copy_check_uvector and compat_rw_copy_check_uvector after
    changes made to support CMA in an earlier patch.

    Rather than having an additional check_access parameter to these
    functions, the first paramater type is overloaded to allow the caller to
    specify CHECK_IOVEC_ONLY which means check that the contents of the iovec
    are valid, but do not check the memory that they point to. This is used
    by process_vm_readv/writev where we need to validate that a iovec passed
    to the syscall is valid but do not want to check the memory that it points
    to at this point because it refers to an address space in another process.

    Signed-off-by: Chris Yeoh
    Reviewed-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christopher Yeoh
     

29 Feb, 2012

1 commit


01 Nov, 2011

1 commit

  • The basic idea behind cross memory attach is to allow MPI programs doing
    intra-node communication to do a single copy of the message rather than a
    double copy of the message via shared memory.

    The following patch attempts to achieve this by allowing a destination
    process, given an address and size from a source process, to copy memory
    directly from the source process into its own address space via a system
    call. There is also a symmetrical ability to copy from the current
    process's address space into a destination process's address space.

    - Use of /proc/pid/mem has been considered, but there are issues with
    using it:
    - Does not allow for specifying iovecs for both src and dest, assuming
    preadv or pwritev was implemented either the area read from or
    written to would need to be contiguous.
    - Currently mem_read allows only processes who are currently
    ptrace'ing the target and are still able to ptrace the target to read
    from the target. This check could possibly be moved to the open call,
    but its not clear exactly what race this restriction is stopping
    (reason appears to have been lost)
    - Having to send the fd of /proc/self/mem via SCM_RIGHTS on unix
    domain socket is a bit ugly from a userspace point of view,
    especially when you may have hundreds if not (eventually) thousands
    of processes that all need to do this with each other
    - Doesn't allow for some future use of the interface we would like to
    consider adding in the future (see below)
    - Interestingly reading from /proc/pid/mem currently actually
    involves two copies! (But this could be fixed pretty easily)

    As mentioned previously use of vmsplice instead was considered, but has
    problems. Since you need the reader and writer working co-operatively if
    the pipe is not drained then you block. Which requires some wrapping to
    do non blocking on the send side or polling on the receive. In all to all
    communication it requires ordering otherwise you can deadlock. And in the
    example of many MPI tasks writing to one MPI task vmsplice serialises the
    copying.

    There are some cases of MPI collectives where even a single copy interface
    does not get us the performance gain we could. For example in an
    MPI_Reduce rather than copy the data from the source we would like to
    instead use it directly in a mathops (say the reduce is doing a sum) as
    this would save us doing a copy. We don't need to keep a copy of the data
    from the source. I haven't implemented this, but I think this interface
    could in the future do all this through the use of the flags - eg could
    specify the math operation and type and the kernel rather than just
    copying the data would apply the specified operation between the source
    and destination and store it in the destination.

    Although we don't have a "second user" of the interface (though I've had
    some nibbles from people who may be interested in using it for intra
    process messaging which is not MPI). This interface is something which
    hardware vendors are already doing for their custom drivers to implement
    fast local communication. And so in addition to this being useful for
    OpenMPI it would mean the driver maintainers don't have to fix things up
    when the mm changes.

    There was some discussion about how much faster a true zero copy would
    go. Here's a link back to the email with some testing I did on that:

    http://marc.info/?l=linux-mm&m=130105930902915&w=2

    There is a basic man page for the proposed interface here:

    http://ozlabs.org/~cyeoh/cma/process_vm_readv.txt

    This has been implemented for x86 and powerpc, other architecture should
    mainly (I think) just need to add syscall numbers for the process_vm_readv
    and process_vm_writev. There are 32 bit compatibility versions for
    64-bit kernels.

    For arch maintainers there are some simple tests to be able to quickly
    verify that the syscalls are working correctly here:

    http://ozlabs.org/~cyeoh/cma/cma-test-20110718.tgz

    Signed-off-by: Chris Yeoh
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Cc: Arnd Bergmann
    Cc: Paul Mackerras
    Cc: Benjamin Herrenschmidt
    Cc: David Howells
    Cc: James Morris
    Cc:
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christopher Yeoh
     

28 Oct, 2011

2 commits

  • Add a generic_file_llseek variant to the VFS that allows passing in
    the maximum file size of the file system, instead of always
    using maxbytes from the superblock.

    This can be used to eliminate some cut'n'paste seek code in ext4.

    Signed-off-by: Andi Kleen
    Signed-off-by: Christoph Hellwig

    Andi Kleen
     
  • The i_mutex lock use of generic _file_llseek hurts. Independent processes
    accessing the same file synchronize over a single lock, even though
    they have no need for synchronization at all.

    Under high utilization this can cause llseek to scale very poorly on larger
    systems.

    This patch does some rethinking of the llseek locking model:

    First the 64bit f_pos is not necessarily atomic without locks
    on 32bit systems. This can already cause races with read() today.
    This was discussed on linux-kernel in the past and deemed acceptable.
    The patch does not change that.

    Let's look at the different seek variants:

    SEEK_SET: Doesn't really need any locking.
    If there's a race one writer wins, the other loses.

    For 32bit the non atomic update races against read()
    stay the same. Without a lock they can also happen
    against write() now. The read() race was deemed
    acceptable in past discussions, and I think if it's
    ok for read it's ok for write too.

    => Don't need a lock.

    SEEK_END: This behaves like SEEK_SET plus it reads
    the maximum size too. Reading the maximum size would have the
    32bit atomic problem. But luckily we already have a way to read
    the maximum size without locking (i_size_read), so we
    can just use that instead.

    Without i_mutex there is no synchronization with write() anymore,
    however since the write() update is atomic on 64bit it just behaves
    like another racy SEEK_SET. On non atomic 32bit it's the same
    as SEEK_SET.

    => Don't need a lock, but need to use i_size_read()

    SEEK_CUR: This has a read-modify-write race window
    on the same file. One could argue that any application
    doing unsynchronized seeks on the same file is already broken.
    But for the sake of not adding a regression here I'm
    using the file->f_lock to synchronize this. Using this
    lock is much better than the inode mutex because it doesn't
    synchronize between processes.

    => So still need a lock, but can use a f_lock.

    This patch implements this new scheme in generic_file_llseek.
    I dropped generic_file_llseek_unlocked and changed all callers.

    Signed-off-by: Andi Kleen
    Signed-off-by: Christoph Hellwig

    Andi Kleen
     

27 Jul, 2011

1 commit


21 Jul, 2011

1 commit

  • This just gets us ready to support the SEEK_HOLE and SEEK_DATA flags. Turns out
    using fiemap in things like cp cause more problems than it solves, so lets try
    and give userspace an interface that doesn't suck. We need to match solaris
    here, and the definitions are

    *o* If /whence/ is SEEK_HOLE, the offset of the start of the
    next hole greater than or equal to the supplied offset
    is returned. The definition of a hole is provided near
    the end of the DESCRIPTION.

    *o* If /whence/ is SEEK_DATA, the file pointer is set to the
    start of the next non-hole file region greater than or
    equal to the supplied offset.

    So in the generic case the entire file is data and there is a virtual hole at
    the end. That means we will just return i_size for SEEK_HOLE and will return
    the same offset for SEEK_DATA. This is how Solaris does it so we have to do it
    the same way.

    Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Al Viro

    Josef Bacik
     

13 Jan, 2011

1 commit


18 Nov, 2010

1 commit


30 Oct, 2010

1 commit


26 Oct, 2010

1 commit

  • Now, rw_verify_area() checsk f_pos is negative or not. And if negative,
    returns -EINVAL.

    But, some special files as /dev/(k)mem and /proc//mem etc.. has
    negative offsets. And we can't do any access via read/write to the
    file(device).

    So introduce FMODE_UNSIGNED_OFFSET to allow negative file offsets.

    Signed-off-by: Wu Fengguang
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Al Viro
    Cc: Heiko Carstens
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    KAMEZAWA Hiroyuki
     

15 Oct, 2010

2 commits

  • All file operations now have an explicit .llseek
    operation pointer, so we can change the default
    action for future code.

    This makes changes the default from default_llseek
    to no_llseek, which always returns -ESPIPE if
    a user tries to seek on a file without a .llseek
    operation.

    The name of the default_llseek function remains
    unchanged, if anyone thinks we should change it,
    please speak up.

    Signed-off-by: Arnd Bergmann
    Cc: Christoph Hellwig
    Cc: Al Viro
    Cc: linux-fsdevel@vger.kernel.org

    Arnd Bergmann
     
  • There are currently 191 users of default_llseek.
    Nine of these are in device drivers that use the
    big kernel lock. None of these ever touch
    file->f_pos outside of llseek or file_pos_write.

    Consequently, we never rely on the BKL
    in the default_llseek function and can
    replace that with i_mutex, which is also
    used in generic_file_llseek.

    Signed-off-by: Arnd Bergmann

    Arnd Bergmann
     

28 Jul, 2010

1 commit

  • fanotify, the upcoming notification system actually needs a struct path so it can
    do opens in the context of listeners, and it needs a file so it can get f_flags
    from the original process. Close was the only operation that already was passing
    a struct file to the notification hook. This patch passes a file for access,
    modify, and open as well as they are easily available to these hooks.

    Signed-off-by: Eric Paris

    Eric Paris
     

28 May, 2010

1 commit

  • This is an implementation of ->llseek useable for the rare special case
    when userspace expects the seek to succeed but the (device) file is
    actually not able to perform the seek. In this case you use noop_llseek()
    instead of falling back to the default implementation of ->llseek.

    Signed-off-by: Jan Blunck
    Cc: Frederic Weisbecker
    Cc: Christoph Hellwig
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    jan Blunck
     

25 Mar, 2010

1 commit


04 Nov, 2009

1 commit

  • sendfile(2) was reworked with the splice infrastructure, but it still
    checks f_op.sendpage() instead of f_op.splice_write() wrongly. Although
    if f_op.sendpage() exists, f_op.splice_write() always exists at the same
    time currently, the assumption will be broken in future silently. This
    patch also brings a side effect: sendfile(2) can work with any output
    file. Some security checks related to f_op are added too.

    Signed-off-by: Changli Gao
    Signed-off-by: Jens Axboe

    Changli Gao
     

24 Sep, 2009

1 commit

  • As Johannes Weiner pointed out, one of the range checks in do_sendfile
    is redundant and is already checked in rw_verify_area.

    Signed-off-by: Jeff Layton
    Reviewed-by: Johannes Weiner
    Cc: Christoph Hellwig
    Cc: Al Viro
    Cc: Robert Love
    Cc: Mandeep Singh Baines
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Jeff Layton
     

11 May, 2009

1 commit

  • If f_op->splice_read() is not implemented, fall back to a plain read.
    Use vfs_readv() to read into previously allocated pages.

    This will allow splice and functions using splice, such as the loop
    device, to work on all filesystems. This includes "direct_io" files
    in fuse which bypass the page cache.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Jens Axboe

    Miklos Szeredi
     

05 Apr, 2009

1 commit

  • Instead of always splitting the file offset into 32-bit 'high' and 'low'
    parts, just split them into the largest natural word-size - which in C
    terms is 'unsigned long'.

    This allows 64-bit architectures to avoid the unnecessary 32-bit
    shifting and masking for native format (while the compat interfaces will
    obviously always have to do it).

    This also changes the order of 'high' and 'low' to be "low first". Why?
    Because when we have it like this, the 64-bit system calls now don't use
    the "pos_high" argument at all, and it makes more sense for the native
    system call to simply match the user-mode prototype.

    This results in a much more natural calling convention, and allows the
    compiler to generate much more straightforward code. On x86-64, we now
    generate

    testq %rcx, %rcx # pos_l
    js .L122 #,
    movq %rcx, -48(%rbp) # pos_l, pos

    from the C source

    loff_t pos = pos_from_hilo(pos_h, pos_l);
    ...
    if (pos < 0)
    return -EINVAL;

    and the 'pos_h' register isn't even touched. It used to generate code
    like

    mov %r8d, %r8d # pos_low, pos_low
    salq $32, %rcx #, tmp71
    movq %r8, %rax # pos_low, pos.386
    orq %rcx, %rax # tmp71, pos.386
    js .L122 #,
    movq %rax, -48(%rbp) # pos.386, pos

    which isn't _that_ horrible, but it does show how the natural word size
    is just a more sensible interface (same arguments will hold in the user
    level glibc wrapper function, of course, so the kernel side is just half
    of the equation!)

    Note: in all cases the user code wrapper can again be the same. You can
    just do

    #define HALF_BITS (sizeof(unsigned long)*4)
    __syscall(PWRITEV, fd, iov, count, offset, (offset >> HALF_BITS) >> HALF_BITS);

    or something like that. That way the user mode wrapper will also be
    nicely passing in a zero (it won't actually have to do the shifts, the
    compiler will understand what is going on) for the last argument.

    And that is a good idea, even if nobody will necessarily ever care: if
    we ever do move to a 128-bit lloff_t, this particular system call might
    be left alone. Of course, that will be the least of our worries if we
    really ever need to care, so this may not be worth really caring about.

    [ Fixed for lost 'loff_t' cast noticed by Andrew Morton ]

    Acked-by: Gerd Hoffmann
    Cc: H. Peter Anvin
    Cc: Andrew Morton
    Cc: linux-api@vger.kernel.org
    Cc: linux-arch@vger.kernel.org
    Cc: Ingo Molnar
    Cc: Ralf Baechle >
    Cc: Al Viro
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

03 Apr, 2009

1 commit

  • This patch adds preadv and pwritev system calls. These syscalls are a
    pretty straightforward combination of pread and readv (same for write).
    They are quite useful for doing vectored I/O in threaded applications.
    Using lseek+readv instead opens race windows you'll have to plug with
    locking.

    Other systems have such system calls too, for example NetBSD, check
    here: http://www.daemon-systems.org/man/preadv.2.html

    The application-visible interface provided by glibc should look like
    this to be compatible to the existing implementations in the *BSD family:

    ssize_t preadv(int d, const struct iovec *iov, int iovcnt, off_t offset);
    ssize_t pwritev(int d, const struct iovec *iov, int iovcnt, off_t offset);

    This prototype has one problem though: On 32bit archs is the (64bit)
    offset argument unaligned, which the syscall ABI of several archs doesn't
    allow to do. At least s390 needs a wrapper in glibc to handle this. As
    we'll need a wrappers in glibc anyway I've decided to push problem to
    glibc entriely and use a syscall prototype which works without
    arch-specific wrappers inside the kernel: The offset argument is
    explicitly splitted into two 32bit values.

    The patch sports the actual system call implementation and the windup in
    the x86 system call tables. Other archs follow as separate patches.

    Signed-off-by: Gerd Hoffmann
    Cc: Arnd Bergmann
    Cc: Al Viro
    Cc:
    Cc:
    Cc: Ralf Baechle
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gerd Hoffmann
     

14 Jan, 2009

5 commits


06 Jan, 2009

1 commit

  • This patch fixes a race condition in lseek. While it is expected that
    unpredictable behaviour may result while repositioning the offset of a
    file descriptor concurrently with reading/writing to the same file
    descriptor, this should not happen when merely *reading* the file
    descriptor's offset.

    Unfortunately, the only portable way in Unix to read a file
    descriptor's offset is lseek(fd, 0, SEEK_CUR); however executing this
    concurrently with read/write may mess up the position.

    [with fixes from akpm]

    Signed-off-by: Alain Knaff
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Alain Knaff
     

23 Oct, 2008

1 commit


03 Jul, 2008

1 commit

  • - Replace remote_llseek with generic_file_llseek_unlocked (to force compilation
    failures in all users)
    - Change all users to either use generic_file_llseek_unlocked directly or
    take the BKL around. I changed the file systems who don't use the BKL
    for anything (CIFS, GFS) to call it directly. NCPFS and SMBFS and NFS
    take the BKL, but explicitely in their own source now.

    I moved them all over in a single patch to avoid unbisectable sections.

    Open problem: 32bit kernels can corrupt fpos because its modification
    is not atomic, but they can do that anyways because there's other paths who
    modify it without BKL.

    Do we need a special lock for the pos/f_version = 0 checks?

    Trond says the NFS BKL is likely not needed, but keep it for now
    until his full audit.

    v2: Use generic_file_llseek_unlocked instead of remote_llseek_unlocked
    and factor duplicated code (suggested by hch)

    Cc: Trond.Myklebust@netapp.com
    Cc: swhiteho@redhat.com
    Cc: sfrench@samba.org
    Cc: vandrove@vc.cvut.cz

    Signed-off-by: Andi Kleen
    Signed-off-by: Andi Kleen
    Signed-off-by: Jonathan Corbet

    Andi Kleen