01 Apr, 2014

1 commit

  • Pull s390 compat wrapper rework from Heiko Carstens:
    "S390 compat system call wrapper simplification work.

    The intention of this work is to get rid of all hand written assembly
    compat system call wrappers on s390, which perform proper sign or zero
    extension, or pointer conversion of compat system call parameters.
    Instead all of this should be done with C code eg by using Al's
    COMPAT_SYSCALL_DEFINEx() macro.

    Therefore all common code and s390 specific compat system calls have
    been converted to the COMPAT_SYSCALL_DEFINEx() macro.

    In order to generate correct code all compat system calls may only
    have eg compat_ulong_t parameters, but no unsigned long parameters.
    Those patches which change parameter types from unsigned long to
    compat_ulong_t parameters are separate in this series, but shouldn't
    cause any harm.

    The only compat system calls which intentionally have 64 bit
    parameters (preadv64 and pwritev64) in support of the x86/32 ABI
    haven't been changed, but are now only available if an architecture
    defines __ARCH_WANT_COMPAT_SYS_PREADV64/PWRITEV64.

    System calls which do not have a compat variant but still need proper
    zero extension on s390, like eg "long sys_brk(unsigned long brk)" will
    get a proper wrapper function with the new s390 specific
    COMPAT_SYSCALL_WRAPx() macro:

    COMPAT_SYSCALL_WRAP1(brk, unsigned long, brk);

    which generates the following code (simplified):

    asmlinkage long sys_brk(unsigned long brk);
    asmlinkage long compat_sys_brk(long brk)
    {
    return sys_brk((u32)brk);
    }

    Given that the C file which contains all the COMPAT_SYSCALL_WRAP lines
    includes both linux/syscall.h and linux/compat.h, it will generate
    build errors, if the declaration of sys_brk() doesn't match, or if
    there exists a non-matching compat_sys_brk() declaration.

    In addition this will intentionally result in a link error if
    somewhere else a compat_sys_brk() function exists, which probably
    should have been used instead. Two more BUILD_BUG_ONs make sure the
    size and type of each compat syscall parameter can be handled
    correctly with the s390 specific macros.

    I converted the compat system calls step by step to verify the
    generated code is correct and matches the previous code. In fact it
    did not always match, however that was always a bug in the hand
    written asm code.

    In result we get less code, less bugs, and much more sanity checking"

    * 'compat' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (44 commits)
    s390/compat: add copyright statement
    compat: include linux/unistd.h within linux/compat.h
    s390/compat: get rid of compat wrapper assembly code
    s390/compat: build error for large compat syscall args
    mm/compat: convert to COMPAT_SYSCALL_DEFINE with changing parameter types
    kexec/compat: convert to COMPAT_SYSCALL_DEFINE with changing parameter types
    net/compat: convert to COMPAT_SYSCALL_DEFINE with changing parameter types
    ipc/compat: convert to COMPAT_SYSCALL_DEFINE with changing parameter types
    fs/compat: convert to COMPAT_SYSCALL_DEFINE with changing parameter types
    ipc/compat: convert to COMPAT_SYSCALL_DEFINE
    fs/compat: convert to COMPAT_SYSCALL_DEFINE
    security/compat: convert to COMPAT_SYSCALL_DEFINE
    mm/compat: convert to COMPAT_SYSCALL_DEFINE
    net/compat: convert to COMPAT_SYSCALL_DEFINE
    kernel/compat: convert to COMPAT_SYSCALL_DEFINE
    fs/compat: optional preadv64/pwrite64 compat system calls
    ipc/compat_sys_msgrcv: change msgtyp type from long to compat_long_t
    s390/compat: partial parameter conversion within syscall wrappers
    s390/compat: automatic zero, sign and pointer conversion of syscalls
    s390/compat: add sync_file_range and fallocate compat syscalls
    ...

    Linus Torvalds
     

23 Mar, 2014

1 commit


10 Mar, 2014

2 commits

  • instead of returning the flags by reference, we can just have the
    low-level primitive return those in lower bits of unsigned long,
    with struct file * derived from the rest.

    Signed-off-by: Al Viro

    Al Viro
     
  • Our write() system call has always been atomic in the sense that you get
    the expected thread-safe contiguous write, but we haven't actually
    guaranteed that concurrent writes are serialized wrt f_pos accesses, so
    threads (or processes) that share a file descriptor and use "write()"
    concurrently would quite likely overwrite each others data.

    This violates POSIX.1-2008/SUSv4 Section XSI 2.9.7 that says:

    "2.9.7 Thread Interactions with Regular File Operations

    All of the following functions shall be atomic with respect to each
    other in the effects specified in POSIX.1-2008 when they operate on
    regular files or symbolic links: [...]"

    and one of the effects is the file position update.

    This unprotected file position behavior is not new behavior, and nobody
    has ever cared. Until now. Yongzhi Pan reported unexpected behavior to
    Michael Kerrisk that was due to this.

    This resolves the issue with a f_pos-specific lock that is taken by
    read/write/lseek on file descriptors that may be shared across threads
    or processes.

    Reported-by: Yongzhi Pan
    Reported-by: Michael Kerrisk
    Cc: Al Viro
    Signed-off-by: Linus Torvalds
    Signed-off-by: Al Viro

    Linus Torvalds
     

06 Mar, 2014

1 commit

  • The preadv64/pwrite64 have been implemented for the x32 ABI, in order
    to allow passing 64 bit arguments from user space without splitting
    them into two 32 bit parameters, like it would be necessary for usual
    compat tasks.
    Howevert these two system calls are only being used for the x32 ABI,
    so add __ARCH_WANT_COMPAT defines for these two compat syscalls and
    make these two only visible for x86.

    Signed-off-by: Heiko Carstens

    Heiko Carstens
     

30 Jan, 2014

1 commit

  • We got a report that the pwritev syscall does not work correctly in
    compat mode on s390.

    It turned out that with commit 72ec35163f9f ("switch compat readv/writev
    variants to COMPAT_SYSCALL_DEFINE") we lost the zero extension of a
    couple of syscall parameters because the some parameter types haven't
    been converted from unsigned long to compat_ulong_t.

    This is needed for architectures where the ABI requires that the caller
    of a function performed zero and/or sign extension to 64 bit of all
    parameters.

    Signed-off-by: Heiko Carstens
    Cc: Al Viro
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Hendrik Brueckner
    Cc: Martin Schwidefsky
    Cc: [v3.10+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Heiko Carstens
     

22 Jan, 2014

1 commit

  • The compat_do_readv_writev() function was doing a verify_area on the
    incoming iov, but the nr_segs value is not checked. If someone passes
    in a -1 for nr_segs, for instance, the function should return an EINVAL.
    However, it returns a EFAULT because the verify_area fails because it is
    checking an array of size MAX_UINT. The check is bogus, anyway, because
    the next check, compat_rw_copy_check_uvector(), will do all the
    necessary checking, anyway. The non-compat do_readv_writev() function
    doesn't do this check, so I think it's safe to just remove the code.

    Signed-off-by: Corey Minyard
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Corey Minyard
     

25 Oct, 2013

1 commit


30 Jul, 2013

1 commit

  • This code doesn't serve any purpose anymore, since the aio retry
    infrastructure has been removed.

    This change should be safe because aio_read/write are also used for
    synchronous IO, and called from do_sync_read()/do_sync_write() - and
    there's no looping done in the sync case (the read and write syscalls).

    Signed-off-by: Kent Overstreet
    Cc: Zach Brown
    Cc: Felipe Balbi
    Cc: Greg Kroah-Hartman
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Rusty Russell
    Cc: Jens Axboe
    Cc: Asai Thambi S P
    Cc: Selvan Mani
    Cc: Sam Bradshaw
    Cc: Jeff Moyer
    Cc: Al Viro
    Cc: Benjamin LaHaise
    Signed-off-by: Benjamin LaHaise

    Kent Overstreet
     

03 Jul, 2013

1 commit

  • For those file systems(btrfs/ext4/ocfs2/tmpfs) that support
    SEEK_DATA/SEEK_HOLE functions, we end up handling the similar
    matter in lseek_execute() to update the current file offset
    to the desired offset if it is valid, ceph also does the
    simliar things at ceph_llseek().

    To reduce the duplications, this patch make lseek_execute()
    public accessible so that we can call it directly from the
    underlying file systems.

    Thanks Dave Chinner for this suggestion.

    [AV: call it vfs_setpos(), don't bring the removed 'inode' argument back]

    v2->v1:
    - Add kernel-doc comments for lseek_execute()
    - Call lseek_execute() in ceph->llseek()

    Signed-off-by: Jie Liu
    Cc: Dave Chinner
    Cc: Al Viro
    Cc: Andi Kleen
    Cc: Andrew Morton
    Cc: Christoph Hellwig
    Cc: Chris Mason
    Cc: Josef Bacik
    Cc: Ben Myers
    Cc: Ted Tso
    Cc: Hugh Dickins
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Sage Weil
    Signed-off-by: Al Viro

    Jie Liu
     

29 Jun, 2013

5 commits


20 Jun, 2013

1 commit


08 May, 2013

2 commits

  • Faster kernel compiles by way of fewer unnecessary includes.

    [akpm@linux-foundation.org: fix fallout]
    [akpm@linux-foundation.org: fix build]
    Signed-off-by: Kent Overstreet
    Cc: Zach Brown
    Cc: Felipe Balbi
    Cc: Greg Kroah-Hartman
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Rusty Russell
    Cc: Jens Axboe
    Cc: Asai Thambi S P
    Cc: Selvan Mani
    Cc: Sam Bradshaw
    Cc: Jeff Moyer
    Cc: Al Viro
    Cc: Benjamin LaHaise
    Reviewed-by: "Theodore Ts'o"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kent Overstreet
     
  • This removes the retry-based AIO infrastructure now that nothing in tree
    is using it.

    We want to remove retry-based AIO because it is fundemantally unsafe.
    It retries IO submission from a kernel thread that has only assumed the
    mm of the submitting task. All other task_struct references in the IO
    submission path will see the kernel thread, not the submitting task.
    This design flaw means that nothing of any meaningful complexity can use
    retry-based AIO.

    This removes all the code and data associated with the retry machinery.
    The most significant benefit of this is the removal of the locking
    around the unused run list in the submission path.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Kent Overstreet
    Signed-off-by: Zach Brown
    Cc: Zach Brown
    Cc: Felipe Balbi
    Cc: Greg Kroah-Hartman
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Rusty Russell
    Cc: Jens Axboe
    Cc: Asai Thambi S P
    Cc: Selvan Mani
    Cc: Sam Bradshaw
    Acked-by: Jeff Moyer
    Cc: Al Viro
    Cc: Benjamin LaHaise
    Reviewed-by: "Theodore Ts'o"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zach Brown
     

05 May, 2013

1 commit


02 May, 2013

1 commit

  • Pull VFS updates from Al Viro,

    Misc cleanups all over the place, mainly wrt /proc interfaces (switch
    create_proc_entry to proc_create(), get rid of the deprecated
    create_proc_read_entry() in favor of using proc_create_data() and
    seq_file etc).

    7kloc removed.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (204 commits)
    don't bother with deferred freeing of fdtables
    proc: Move non-public stuff from linux/proc_fs.h to fs/proc/internal.h
    proc: Make the PROC_I() and PDE() macros internal to procfs
    proc: Supply a function to remove a proc entry by PDE
    take cgroup_open() and cpuset_open() to fs/proc/base.c
    ppc: Clean up scanlog
    ppc: Clean up rtas_flash driver somewhat
    hostap: proc: Use remove_proc_subtree()
    drm: proc: Use remove_proc_subtree()
    drm: proc: Use minor->index to label things, not PDE->name
    drm: Constify drm_proc_list[]
    zoran: Don't print proc_dir_entry data in debug
    reiserfs: Don't access the proc_dir_entry in r_open(), r_start() r_show()
    proc: Supply an accessor for getting the data from a PDE's parent
    airo: Use remove_proc_subtree()
    rtl8192u: Don't need to save device proc dir PDE
    rtl8187se: Use a dir under /proc/net/r8180/
    proc: Add proc_mkdir_data()
    proc: Move some bits from linux/proc_fs.h to linux/{of.h,signal.h,tty.h}
    proc: Move PDE_NET() to fs/proc/proc_net.c
    ...

    Linus Torvalds
     

01 May, 2013

1 commit

  • Pull compat cleanup from Al Viro:
    "Mostly about syscall wrappers this time; there will be another pile
    with patches in the same general area from various people, but I'd
    rather push those after both that and vfs.git pile are in."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
    syscalls.h: slightly reduce the jungles of macros
    get rid of union semop in sys_semctl(2) arguments
    make do_mremap() static
    sparc: no need to sign-extend in sync_file_range() wrapper
    ppc compat wrappers for add_key(2) and request_key(2) are pointless
    x86: trim sys_ia32.h
    x86: sys32_kill and sys32_mprotect are pointless
    get rid of compat_sys_semctl() and friends in case of ARCH_WANT_OLD_COMPAT_IPC
    merge compat sys_ipc instances
    consolidate compat lookup_dcookie()
    convert vmsplice to COMPAT_SYSCALL_DEFINE
    switch getrusage() to COMPAT_SYSCALL_DEFINE
    switch epoll_pwait to COMPAT_SYSCALL_DEFINE
    convert sendfile{,64} to COMPAT_SYSCALL_DEFINE
    switch signalfd{,4}() to COMPAT_SYSCALL_DEFINE
    make SYSCALL_DEFINE-generated wrappers do asmlinkage_protect
    make HAVE_SYSCALL_WRAPPERS unconditional
    consolidate cond_syscall and SYSCALL_ALIAS declarations
    teach SYSCALL_DEFINE how to deal with long long/unsigned long long
    get rid of duplicate logics in __SC_....[1-6] definitions

    Linus Torvalds
     

30 Apr, 2013

1 commit


10 Apr, 2013

3 commits


28 Mar, 2013

1 commit

  • Commit 06ae43f34bcc ("Don't bother with redoing rw_verify_area() from
    default_file_splice_from()") lost the checks to test existence of the
    write/aio_write methods. My apologies ;-/

    Eventually, we want that in fs/splice.c side of things (no point
    repeating it for every buffer, after all), but for now this is the
    obvious minimal fix.

    Reported-by: Dave Jones
    Signed-off-by: Al Viro
    Signed-off-by: Linus Torvalds

    Al Viro
     

22 Mar, 2013

1 commit

  • default_file_splice_from() ends up calling vfs_write() (via very convoluted
    callchain). It's an overkill, since we already have done rw_verify_area()
    in the caller by the time we call vfs_write() we are under set_fs(KERNEL_DS),
    so access_ok() is also pointless. Add a new helper (__kernel_write()),
    use it instead of kernel_write() in there.

    Signed-off-by: Al Viro

    Al Viro
     

04 Mar, 2013

2 commits


03 Mar, 2013

1 commit

  • Pull signal/compat fixes from Al Viro:
    "Fixes for several regressions introduced in the last signal.git pile,
    along with fixing bugs in truncate and ftruncate compat (on just about
    anything biarch at least one of those two had been done wrong)."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
    compat: restore timerfd settime and gettime compat syscalls
    [regression] braino in "sparc: convert to ksignal"
    fix compat truncate/ftruncate
    switch lseek to COMPAT_SYSCALL_DEFINE
    lseek() and truncate() on sparc really need sign extension

    Linus Torvalds
     

24 Feb, 2013

1 commit


23 Feb, 2013

1 commit


21 Dec, 2012

1 commit

  • do_sendfile() in fs/read_write.c does not call the fsnotify functions,
    unlike its neighbors. This manifests as a lack of inotify ACCESS events
    when a file is sent using sendfile(2).

    Addresses
    https://bugzilla.kernel.org/show_bug.cgi?id=12812

    [akpm@linux-foundation.org: use fsnotify_modify(out.file), not fsnotify_access(), per Dave]
    Signed-off-by: Alan Cox
    Cc: Dave Chinner
    Cc: Jens Axboe
    Cc: Scott Wolchok
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Scott Wolchok
     

18 Dec, 2012

1 commit


03 Oct, 2012

1 commit

  • This function is used by sparc, powerpc and arm64 for compat support.
    The patch adds a generic implementation which calls do_sendfile()
    directly and avoids set_fs().

    The sparc architecture has wrappers for the sign extensions while
    powerpc relies on the compiler to do the this. The patch adds wrappers
    for powerpc to handle the u32->int type conversion.

    compat_sys_sendfile64() can be replaced by a sys_sendfile() call since
    compat_loff_t has the same size as off_t on a 64-bit system.

    On powerpc, the patch also changes the 64-bit sendfile call from
    sys_sendile64 to sys_sendfile.

    Signed-off-by: Catalin Marinas
    Acked-by: David S. Miller
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Alexander Viro
    Cc: Andrew Morton
    Signed-off-by: Al Viro

    Catalin Marinas
     

27 Sep, 2012

1 commit


23 Jul, 2012

1 commit

  • For ext3/4 htree directories, using the vfs llseek function with
    SEEK_END goes to i_size like for any other file, but in reality
    we want the maximum possible hash value. Recent changes
    in ext4 have cut & pasted generic_file_llseek() back into fs/ext4/dir.c,
    but replicating this core code seems like a bad idea, especially
    since the copy has already diverged from the vfs.

    This patch updates generic_file_llseek_size to accept
    both a custom maximum offset, and a custom EOF position. With this
    in place, ext4_dir_llseek can pass in the appropriate maximum hash
    position for both maxsize and eof, and get what it wants.

    As far as I know, this does not fix any bugs - nfs in the kernel
    doesn't use SEEK_END, and I don't know of any user who does. But
    some ext4 folks seem keen on doing the right thing here, and I can't
    really argue.

    (Patch also fixes up some comments slightly)

    Signed-off-by: Eric Sandeen
    Signed-off-by: Al Viro

    Eric Sandeen
     

01 Jun, 2012

1 commit

  • A cleanup of rw_copy_check_uvector and compat_rw_copy_check_uvector after
    changes made to support CMA in an earlier patch.

    Rather than having an additional check_access parameter to these
    functions, the first paramater type is overloaded to allow the caller to
    specify CHECK_IOVEC_ONLY which means check that the contents of the iovec
    are valid, but do not check the memory that they point to. This is used
    by process_vm_readv/writev where we need to validate that a iovec passed
    to the syscall is valid but do not want to check the memory that it points
    to at this point because it refers to an address space in another process.

    Signed-off-by: Chris Yeoh
    Reviewed-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christopher Yeoh
     

29 Feb, 2012

1 commit


01 Nov, 2011

1 commit

  • The basic idea behind cross memory attach is to allow MPI programs doing
    intra-node communication to do a single copy of the message rather than a
    double copy of the message via shared memory.

    The following patch attempts to achieve this by allowing a destination
    process, given an address and size from a source process, to copy memory
    directly from the source process into its own address space via a system
    call. There is also a symmetrical ability to copy from the current
    process's address space into a destination process's address space.

    - Use of /proc/pid/mem has been considered, but there are issues with
    using it:
    - Does not allow for specifying iovecs for both src and dest, assuming
    preadv or pwritev was implemented either the area read from or
    written to would need to be contiguous.
    - Currently mem_read allows only processes who are currently
    ptrace'ing the target and are still able to ptrace the target to read
    from the target. This check could possibly be moved to the open call,
    but its not clear exactly what race this restriction is stopping
    (reason appears to have been lost)
    - Having to send the fd of /proc/self/mem via SCM_RIGHTS on unix
    domain socket is a bit ugly from a userspace point of view,
    especially when you may have hundreds if not (eventually) thousands
    of processes that all need to do this with each other
    - Doesn't allow for some future use of the interface we would like to
    consider adding in the future (see below)
    - Interestingly reading from /proc/pid/mem currently actually
    involves two copies! (But this could be fixed pretty easily)

    As mentioned previously use of vmsplice instead was considered, but has
    problems. Since you need the reader and writer working co-operatively if
    the pipe is not drained then you block. Which requires some wrapping to
    do non blocking on the send side or polling on the receive. In all to all
    communication it requires ordering otherwise you can deadlock. And in the
    example of many MPI tasks writing to one MPI task vmsplice serialises the
    copying.

    There are some cases of MPI collectives where even a single copy interface
    does not get us the performance gain we could. For example in an
    MPI_Reduce rather than copy the data from the source we would like to
    instead use it directly in a mathops (say the reduce is doing a sum) as
    this would save us doing a copy. We don't need to keep a copy of the data
    from the source. I haven't implemented this, but I think this interface
    could in the future do all this through the use of the flags - eg could
    specify the math operation and type and the kernel rather than just
    copying the data would apply the specified operation between the source
    and destination and store it in the destination.

    Although we don't have a "second user" of the interface (though I've had
    some nibbles from people who may be interested in using it for intra
    process messaging which is not MPI). This interface is something which
    hardware vendors are already doing for their custom drivers to implement
    fast local communication. And so in addition to this being useful for
    OpenMPI it would mean the driver maintainers don't have to fix things up
    when the mm changes.

    There was some discussion about how much faster a true zero copy would
    go. Here's a link back to the email with some testing I did on that:

    http://marc.info/?l=linux-mm&m=130105930902915&w=2

    There is a basic man page for the proposed interface here:

    http://ozlabs.org/~cyeoh/cma/process_vm_readv.txt

    This has been implemented for x86 and powerpc, other architecture should
    mainly (I think) just need to add syscall numbers for the process_vm_readv
    and process_vm_writev. There are 32 bit compatibility versions for
    64-bit kernels.

    For arch maintainers there are some simple tests to be able to quickly
    verify that the syscalls are working correctly here:

    http://ozlabs.org/~cyeoh/cma/cma-test-20110718.tgz

    Signed-off-by: Chris Yeoh
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Cc: Arnd Bergmann
    Cc: Paul Mackerras
    Cc: Benjamin Herrenschmidt
    Cc: David Howells
    Cc: James Morris
    Cc:
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christopher Yeoh