11 Dec, 2014

1 commit

  • Pull VFS changes from Al Viro:
    "First pile out of several (there _definitely_ will be more). Stuff in
    this one:

    - unification of d_splice_alias()/d_materialize_unique()

    - iov_iter rewrite

    - killing a bunch of ->f_path.dentry users (and f_dentry macro).

    Getting that completed will make life much simpler for
    unionmount/overlayfs, since then we'll be able to limit the places
    sensitive to file _dentry_ to reasonably few. Which allows to have
    file_inode(file) pointing to inode in a covered layer, with dentry
    pointing to (negative) dentry in union one.

    Still not complete, but much closer now.

    - crapectomy in lustre (dead code removal, mostly)

    - "let's make seq_printf return nothing" preparations

    - assorted cleanups and fixes

    There _definitely_ will be more piles"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
    copy_from_iter_nocache()
    new helper: iov_iter_kvec()
    csum_and_copy_..._iter()
    iov_iter.c: handle ITER_KVEC directly
    iov_iter.c: convert copy_to_iter() to iterate_and_advance
    iov_iter.c: convert copy_from_iter() to iterate_and_advance
    iov_iter.c: get rid of bvec_copy_page_{to,from}_iter()
    iov_iter.c: convert iov_iter_zero() to iterate_and_advance
    iov_iter.c: convert iov_iter_get_pages_alloc() to iterate_all_kinds
    iov_iter.c: convert iov_iter_get_pages() to iterate_all_kinds
    iov_iter.c: convert iov_iter_npages() to iterate_all_kinds
    iov_iter.c: iterate_and_advance
    iov_iter.c: macros for iterating over iov_iter
    kill f_dentry macro
    dcache: fix kmemcheck warning in switch_names
    new helper: audit_file()
    nfsd_vfs_write(): use file_inode()
    ncpfs: use file_inode()
    kill f_dentry uses
    lockd: get rid of ->f_path.dentry->d_sb
    ...

    Linus Torvalds
     

04 Dec, 2014

1 commit

  • ipc_addid() makes a new ipc identifier visible to everyone. New objects
    start as locked, so that the caller can complete the initialization
    after the call. Within struct sem_array, at least sma->sem_base and
    sma->sem_nsems are accessed without any locks, therefore this approach
    doesn't work.

    Thus: Move the ipc_addid() to the end of the initialization.

    Signed-off-by: Manfred Spraul
    Reported-by: Rik van Riel
    Acked-by: Rik van Riel
    Acked-by: Davidlohr Bueso
    Acked-by: Rafael Aquini
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     

20 Nov, 2014

1 commit

  • ... for situations when we don't have any candidate in pathnames - basically,
    in descriptor-based syscalls.

    [Folded the build fix for !CONFIG_AUDITSYSCALL configs from Chen Gang]

    Signed-off-by: Al Viro

    Al Viro
     

14 Oct, 2014

4 commits

  • Resolve some shadow warnings produced in W=2 builds by changing the name
    of some parameters and local variables. Change instances of "s64"
    because that clashes with the well-known typedef. Also change a local
    variable with the name "up" because that clashes with the name of of the
    "up" function for semaphores. These are hazards so eliminate the
    hazards by renaming them.

    Signed-off-by: Mark Rustad
    Signed-off-by: Jeff Kirsher
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mark Rustad
     
  • Using __seq_open_private() removes boilerplate code from
    sysvipc_proc_open().

    The resultant code is shorter and easier to follow.

    However, please note that __seq_open_private() call kzalloc() rather than
    kmalloc() which may affect timing due to the memory initialisation
    overhead.

    Signed-off-by: Rob Jones
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rob Jones
     
  • do_shmat() is the only user of ->start_stack (proc just reports its
    value), and this check looks ugly and wrong.

    The reason for this check is not clear at all, and it wrongly assumes that
    the stack can only grow down.

    But the main problem is that in general mm->start_stack has nothing to do
    with stack_vma->vm_start. Not only the application can switch to another
    stack and even unmap this area, setup_arg_pages() expands the stack
    without updating mm->start_stack during exec(). This means that in the
    likely case "addr > start_stack - size - PAGE_SIZE * 5" is simply
    impossible after find_vma_intersection() == F, or the stack can't grow
    anyway because of RLIMIT_STACK.

    Many thanks to Hugh for his explanations.

    Signed-off-by: Oleg Nesterov
    Acked-by: Hugh Dickins
    Cc: Cyrill Gorcunov
    Cc: Davidlohr Bueso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • proc_dointvec_minmax() returns zero if a new value has been set. So we
    don't need to check all charecters have been handled.

    Below you can find two examples. In the new value has not been handled
    properly.

    $ strace ./a.out
    open("/proc/sys/kernel/auto_msgmni", O_WRONLY) = 3
    write(3, "0\n\0", 3) = 2
    close(3) = 0
    exit_group(0)
    $ cat /sys/kernel/debug/tracing/trace

    $strace ./a.out
    open("/proc/sys/kernel/auto_msgmni", O_WRONLY) = 3
    write(3, "0\n", 2) = 2
    close(3) = 0

    $ cat /sys/kernel/debug/tracing/trace
    a.out-697 [000] .... 3280.998235: unregister_ipcns_notifier
    Cc: Mathias Krause
    Cc: Manfred Spraul
    Cc: Joe Perches
    Cc: Davidlohr Bueso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Vagin
     

08 Oct, 2014

1 commit

  • Pull "trivial tree" updates from Jiri Kosina:
    "Usual pile from trivial tree everyone is so eagerly waiting for"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits)
    Remove MN10300_PROC_MN2WS0038
    mei: fix comments
    treewide: Fix typos in Kconfig
    kprobes: update jprobe_example.c for do_fork() change
    Documentation: change "&" to "and" in Documentation/applying-patches.txt
    Documentation: remove obsolete pcmcia-cs from Changes
    Documentation: update links in Changes
    Documentation: Docbook: Fix generated DocBook/kernel-api.xml
    score: Remove GENERIC_HAS_IOMAP
    gpio: fix 'CONFIG_GPIO_IRQCHIP' comments
    tty: doc: Fix grammar in serial/tty
    dma-debug: modify check_for_stack output
    treewide: fix errors in printk
    genirq: fix reference in devm_request_threaded_irq comment
    treewide: fix synchronize_rcu() in comments
    checkstack.pl: port to AArch64
    doc: queue-sysfs: minor fixes
    init/do_mounts: better syntax description
    MIPS: fix comment spelling
    powerpc/simpleboot: fix comment
    ...

    Linus Torvalds
     

09 Sep, 2014

1 commit


10 Aug, 2014

1 commit

  • Pull namespace updates from Eric Biederman:
    "This is a bunch of small changes built against 3.16-rc6. The most
    significant change for users is the first patch which makes setns
    drmatically faster by removing unneded rcu handling.

    The next chunk of changes are so that "mount -o remount,.." will not
    allow the user namespace root to drop flags on a mount set by the
    system wide root. Aks this forces read-only mounts to stay read-only,
    no-dev mounts to stay no-dev, no-suid mounts to stay no-suid, no-exec
    mounts to stay no exec and it prevents unprivileged users from messing
    with a mounts atime settings. I have included my test case as the
    last patch in this series so people performing backports can verify
    this change works correctly.

    The next change fixes a bug in NFS that was discovered while auditing
    nsproxy users for the first optimization. Today you can oops the
    kernel by reading /proc/fs/nfsfs/{servers,volumes} if you are clever
    with pid namespaces. I rebased and fixed the build of the
    !CONFIG_NFS_FS case yesterday when a build bot caught my typo. Given
    that no one to my knowledge bases anything on my tree fixing the typo
    in place seems more responsible that requiring a typo-fix to be
    backported as well.

    The last change is a small semantic cleanup introducing
    /proc/thread-self and pointing /proc/mounts and /proc/net at it. This
    prevents several kinds of problemantic corner cases. It is a
    user-visible change so it has a minute chance of causing regressions
    so the change to /proc/mounts and /proc/net are individual one line
    commits that can be trivially reverted. Unfortunately I lost and
    could not find the email of the original reporter so he is not
    credited. From at least one perspective this change to /proc/net is a
    refgression fix to allow pthread /proc/net uses that were broken by
    the introduction of the network namespace"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    proc: Point /proc/mounts at /proc/thread-self/mounts instead of /proc/self/mounts
    proc: Point /proc/net at /proc/thread-self/net instead of /proc/self/net
    proc: Implement /proc/thread-self to point at the directory of the current thread
    proc: Have net show up under /proc//task/
    NFS: Fix /proc/fs/nfsfs/servers and /proc/fs/nfsfs/volumes
    mnt: Add tests for unprivileged remount cases that have found to be faulty
    mnt: Change the default remount atime from relatime to the existing value
    mnt: Correct permission checks in do_remount
    mnt: Move the test for MNT_LOCK_READONLY from change_mount_flags into do_remount
    mnt: Only change user settable mount flags in remount
    namespaces: Use task_lock and not rcu to protect nsproxy

    Linus Torvalds
     

09 Aug, 2014

2 commits

  • If shm_rmid_force (the default state) is not set then the shmids are only
    marked as orphaned and does not require any add, delete, or locking of the
    tree structure.

    Seperate the sysctl on and off case, and only obtain the read lock. The
    newly added list head can be deleted under the read lock because we are
    only called with current and will only change the semids allocated by this
    task and not manipulate the list.

    This commit assumes that up_read includes a sufficient memory barrier for
    the writes to be seen my others that later obtain a write lock.

    Signed-off-by: Milton Miller
    Signed-off-by: Jack Miller
    Cc: Davidlohr Bueso
    Cc: Manfred Spraul
    Cc: Anton Blanchard
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jack Miller
     
  • This is small set of patches our team has had kicking around for a few
    versions internally that fixes tasks getting hung on shm_exit when there
    are many threads hammering it at once.

    Anton wrote a simple test to cause the issue:

    http://ozlabs.org/~anton/junkcode/bust_shm_exit.c

    Before applying this patchset, this test code will cause either hanging
    tracebacks or pthread out of memory errors.

    After this patchset, it will still produce output like:

    root@somehost:~# ./bust_shm_exit 1024 160
    ...
    INFO: rcu_sched detected stalls on CPUs/tasks: {} (detected by 116, t=2111 jiffies, g=241, c=240, q=7113)
    INFO: Stall ended before state dump start
    ...

    But the task will continue to run along happily, so we consider this an
    improvement over hanging, even if it's a bit noisy.

    This patch (of 3):

    exit_shm obtains the ipc_ns shm rwsem for write and holds it while it
    walks every shared memory segment in the namespace. Thus the amount of
    work is related to the number of shm segments in the namespace not the
    number of segments that might need to be cleaned.

    In addition, this occurs after the task has been notified the thread has
    exited, so the number of tasks waiting for the ns shm rwsem can grow
    without bound until memory is exausted.

    Add a list to the task struct of all shmids allocated by this task. Init
    the list head in copy_process. Use the ns->rwsem for locking. Add
    segments after id is added, remove before removing from id.

    On unshare of NEW_IPCNS orphan any ids as if the task had exited, similar
    to handling of semaphore undo.

    I chose a define for the init sequence since its a simple list init,
    otherwise it would require a function call to avoid include loops between
    the semaphore code and the task struct. Converting the list_del to
    list_del_init for the unshare cases would remove the exit followed by
    init, but I left it blow up if not inited.

    Signed-off-by: Milton Miller
    Signed-off-by: Jack Miller
    Cc: Davidlohr Bueso
    Cc: Manfred Spraul
    Cc: Anton Blanchard
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jack Miller
     

30 Jul, 2014

1 commit

  • The synchronous syncrhonize_rcu in switch_task_namespaces makes setns
    a sufficiently expensive system call that people have complained.

    Upon inspect nsproxy no longer needs rcu protection for remote reads.
    remote reads are rare. So optimize for same process reads and write
    by switching using rask_lock instead.

    This yields a simpler to understand lock, and a faster setns system call.

    In particular this fixes a performance regression observed
    by Rafael David Tinoco .

    This is effectively a revert of Pavel Emelyanov's commit
    cf7b708c8d1d7a27736771bcf4c457b332b0f818 Make access to task's nsproxy lighter
    from 2007. The race this originialy fixed no longer exists as
    do_notify_parent uses task_active_pid_ns(parent) instead of
    parent->nsproxy.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

07 Jun, 2014

16 commits

  • This typedef is unnecessary and should just be removed.

    Signed-off-by: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • The actual Linux implementation for semctl(GETNCNT) and semctl(GETZCNT)
    always (since 0.99.10) reported a thread as sleeping on all semaphores
    that are listed in the semop() call.

    The documented behavior (both in the Linux man page and in the Single
    Unix Specification) is that a task should be reported on exactly one
    semaphore: The semaphore that caused the thread to got to sleep.

    This patch adds a pr_info_once() that is triggered if a thread hits the
    relevant case.

    The code triggers slightly too often, otherwise it would be necessary to
    replicate the old code. As there are no known users of GETNCNT or
    GETZCNT, this is done to prevent unnecessary bloat.

    The task that triggered is reported with name (tsk->comm) and pid.

    Signed-off-by: Manfred Spraul
    Acked-by: Davidlohr Bueso
    Cc: Michael Kerrisk
    Cc: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • SUSv4 clearly defines how semncnt and semzcnt must be calculated: A task
    waits on exactly one semaphore: The semaphore from the first operation
    in the sop array that cannot proceed.

    The Linux implementation never followed the standard, it tried to count
    all semaphores that might be the reason why a task sleeps.

    This patch fixes that.

    Note:
    a) The implementation assumes that GETNCNT and GETZCNT are rare operations,
    therefore the code counts them only on demand.
    (If they wouldn't be rare, then the non-compliance would have
    been found earlier)

    b) compared to the initial version of the patch, the BUG_ONs were removed
    and it was clarified that the new behavior conforms to SUS.

    Back-compatibility concerns:

    Manfred:

    : - there is no application in Fedora that uses GETNCNT or GETZCNT.
    :
    : - application that use only single-sop semop() are also safe, the
    : difference only affects complex apps.
    :
    : - portable application are also safe, the new behavior is standard
    : compliant.
    :
    : But that's it. The old behavior existed in Linux from 0.99.something
    : until now.

    Michael:

    : * These operations seem to be very little used. Grepping the public
    : source that is contained Fedora 20 source DVD, there appear to be no
    : uses. Of course, this says nothing about uses in private /
    : non-mainstream FOSS code, but it seems likely that the same pattern
    : is followed there.
    :
    : * The existing behavior is hard enough to understand that I suspect
    : that no one understood it well enough to rely on it anyway
    : (especially as that behavior contradicted both man page and POSIX).
    :
    : So, there's a chance of breakage, but I estimate that it's minute.

    Signed-off-by: Manfred Spraul
    Cc: Davidlohr Bueso
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • Preparation for the next patch:

    In the slow-path of perform_atomic_semop(), store a pointer to the
    operation that caused the operation to block.

    Signed-off-by: Manfred Spraul
    Cc: Davidlohr Bueso
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • Right now, perform_atomic_semop gets the content of sem_queue as
    individual fields. Changes that, instead pass a pointer to sem_queue.

    This is a preparation for the next patch: it uses sem_queue to store the
    reason why a task must sleep.

    Signed-off-by: Manfred Spraul
    Cc: Davidlohr Bueso
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • count_semzcnt and count_semncnt are more of less identical. The patch
    creates a single function that either counts the number of tasks waiting
    for zero or waiting due to a decrease operation.

    Compared to the initial version, the BUG_ONs were removed.

    Signed-off-by: Manfred Spraul
    Cc: Davidlohr Bueso
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • GETZCNT is supposed to return the number of threads that wait until a
    semaphore value becomes 0.

    The current implementation overlooks complex operations that contain
    both wait-for-zero operation and operations that alter at least one
    semaphore.

    The patch fixes that. It's intentionally copy&paste, this will be
    cleaned up in the next patch.

    Signed-off-by: Manfred Spraul
    Cc: Davidlohr Bueso
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • The need for volatile is not obvious, document it.

    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Manfred Spraul
    Cc: Aswin Chandramouleeswaran
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • Nothing big and no logical changes, just get rid of some redundant
    function declarations. Move msg_[init/exit]_ns down the end of the
    file.

    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Manfred Spraul
    Cc: Aswin Chandramouleeswaran
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • Call __set_current_state() instead of assigning the new state directly.

    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Manfred Spraul
    Cc: Aswin Chandramouleeswaran
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • SHMMAX is the upper limit for the size of a shared memory segment, counted
    in bytes. The actual allocation is that size, rounded up to the next full
    page.

    Add a check that prevents the creation of segments where the rounded up
    size causes an integer overflow.

    Signed-off-by: Manfred Spraul
    Acked-by: Davidlohr Bueso
    Acked-by: KOSAKI Motohiro
    Acked-by: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • shm_tot counts the total number of pages used by shm segments.

    If SHMALL is ULONG_MAX (or nearly ULONG_MAX), then the number can
    overflow. Subsequent calls to shmctl(,SHM_INFO,) would return wrong
    values for shm_tot.

    The patch adds a detection for overflows.

    Signed-off-by: Manfred Spraul
    Acked-by: Davidlohr Bueso
    Acked-by: KOSAKI Motohiro
    Acked-by: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • The increase of SHMMAX/SHMALL is a 4 patch series.

    The change itself is trivial, the only problem are interger overflows.
    The overflows are not new, but if we make huge values the default, then
    the code should be free from overflows.

    SHMMAX:

    - shmmem_file_setup places a hard limit on the segment size:
    MAX_LFS_FILESIZE.

    On 32-bit, the limit is > 1 TB, i.e. 4 GB-1 byte segments are
    possible. Rounded up to full pages the actual allocated size
    is 0. --> must be fixed, patch 3

    - shmat:
    - find_vma_intersection does not handle overflows properly.
    --> must be fixed, patch 1

    - the rest is fine, do_mmap_pgoff limits mappings to TASK_SIZE
    and checks for overflows (i.e.: map 2 GB, starting from
    addr=2.5GB fails).

    SHMALL:
    - after creating 8192 segments size (1L< must be fixed, patch 2.

    Userspace:
    - Obviously, there could be overflows in userspace. There is nothing
    we can do, only use values smaller than ULONG_MAX.
    I ended with "ULONG_MAX - 1L<
    Acked-by: Davidlohr Bueso
    Acked-by: KOSAKI Motohiro
    Acked-by: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • trailing whitespace

    Signed-off-by: Paul McQuade
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul McQuade
     
  • Use #include instead of
    Use #include instead of

    Signed-off-by: Paul McQuade
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul McQuade
     
  • There is no need to recreate the very same ipc_ops structure on every
    kernel entry for msgget/semget/shmget. Just declare it static and be
    done with it. While at it, constify it as we don't modify the structure
    at runtime.

    Found in the PaX patch, written by the PaX Team.

    Signed-off-by: Mathias Krause
    Cc: PaX Team
    Cc: Davidlohr Bueso
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mathias Krause
     

08 Apr, 2014

2 commits

  • ... since __initcall is now deprecated.

    Signed-off-by: Davidlohr Bueso
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • This macro appears to have been introduced back in the 2.5 era for
    semtimedop32 backward compatibility on ia32:

    https://lkml.org/lkml/2003/4/28/78

    Nowadays, this syscall in compat just defaults back to the code found in
    sem.c, so it is no longer used and can thus be removed:

    long compat_sys_semtimedop(int semid, struct sembuf __user *tsems,
    unsigned nsops, const struct compat_timespec __user *timeout)
    {
    struct timespec __user *ts64;
    if (compat_convert_timespec(&ts64, timeout))
    return -EFAULT;
    return sys_semtimedop(semid, tsems, nsops, ts64);
    }

    Furthermore, there are no users in compat.c. After this change, kernel
    builds just fine with both CONFIG_SYSVIPC_COMPAT and CONFIG_SYSVIPC.

    Signed-off-by: Davidlohr Bueso
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     

03 Apr, 2014

1 commit

  • Pull compat time conversion changes from Peter Anvin:
    "Despite the branch name this is really neither an x86 nor an
    x32-specific patchset, although it the implementation of the
    discussions that followed the x32 security hole a few months ago.

    This removes get/put_compat_timespec/val() and replaces them with
    compat_get/put_timespec/val() which are savvy as to the current status
    of COMPAT_USE_64BIT_TIME.

    It removes several unused and/or incorrect/misleading functions (like
    compat_put_timeval_convert which doesn't in fact do any conversion)
    and also replaces several open-coded implementations what is now
    called compat_convert_timespec() with that function"

    * 'x86-x32-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    compat: Fix sparse address space warnings
    compat: Get rid of (get|put)_compat_time(val|spec)

    Linus Torvalds
     

01 Apr, 2014

1 commit

  • Pull s390 compat wrapper rework from Heiko Carstens:
    "S390 compat system call wrapper simplification work.

    The intention of this work is to get rid of all hand written assembly
    compat system call wrappers on s390, which perform proper sign or zero
    extension, or pointer conversion of compat system call parameters.
    Instead all of this should be done with C code eg by using Al's
    COMPAT_SYSCALL_DEFINEx() macro.

    Therefore all common code and s390 specific compat system calls have
    been converted to the COMPAT_SYSCALL_DEFINEx() macro.

    In order to generate correct code all compat system calls may only
    have eg compat_ulong_t parameters, but no unsigned long parameters.
    Those patches which change parameter types from unsigned long to
    compat_ulong_t parameters are separate in this series, but shouldn't
    cause any harm.

    The only compat system calls which intentionally have 64 bit
    parameters (preadv64 and pwritev64) in support of the x86/32 ABI
    haven't been changed, but are now only available if an architecture
    defines __ARCH_WANT_COMPAT_SYS_PREADV64/PWRITEV64.

    System calls which do not have a compat variant but still need proper
    zero extension on s390, like eg "long sys_brk(unsigned long brk)" will
    get a proper wrapper function with the new s390 specific
    COMPAT_SYSCALL_WRAPx() macro:

    COMPAT_SYSCALL_WRAP1(brk, unsigned long, brk);

    which generates the following code (simplified):

    asmlinkage long sys_brk(unsigned long brk);
    asmlinkage long compat_sys_brk(long brk)
    {
    return sys_brk((u32)brk);
    }

    Given that the C file which contains all the COMPAT_SYSCALL_WRAP lines
    includes both linux/syscall.h and linux/compat.h, it will generate
    build errors, if the declaration of sys_brk() doesn't match, or if
    there exists a non-matching compat_sys_brk() declaration.

    In addition this will intentionally result in a link error if
    somewhere else a compat_sys_brk() function exists, which probably
    should have been used instead. Two more BUILD_BUG_ONs make sure the
    size and type of each compat syscall parameter can be handled
    correctly with the s390 specific macros.

    I converted the compat system calls step by step to verify the
    generated code is correct and matches the previous code. In fact it
    did not always match, however that was always a bug in the hand
    written asm code.

    In result we get less code, less bugs, and much more sanity checking"

    * 'compat' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (44 commits)
    s390/compat: add copyright statement
    compat: include linux/unistd.h within linux/compat.h
    s390/compat: get rid of compat wrapper assembly code
    s390/compat: build error for large compat syscall args
    mm/compat: convert to COMPAT_SYSCALL_DEFINE with changing parameter types
    kexec/compat: convert to COMPAT_SYSCALL_DEFINE with changing parameter types
    net/compat: convert to COMPAT_SYSCALL_DEFINE with changing parameter types
    ipc/compat: convert to COMPAT_SYSCALL_DEFINE with changing parameter types
    fs/compat: convert to COMPAT_SYSCALL_DEFINE with changing parameter types
    ipc/compat: convert to COMPAT_SYSCALL_DEFINE
    fs/compat: convert to COMPAT_SYSCALL_DEFINE
    security/compat: convert to COMPAT_SYSCALL_DEFINE
    mm/compat: convert to COMPAT_SYSCALL_DEFINE
    net/compat: convert to COMPAT_SYSCALL_DEFINE
    kernel/compat: convert to COMPAT_SYSCALL_DEFINE
    fs/compat: optional preadv64/pwrite64 compat system calls
    ipc/compat_sys_msgrcv: change msgtyp type from long to compat_long_t
    s390/compat: partial parameter conversion within syscall wrappers
    s390/compat: automatic zero, sign and pointer conversion of syscalls
    s390/compat: add sync_file_range and fallocate compat syscalls
    ...

    Linus Torvalds
     

17 Mar, 2014

1 commit

  • While testing and documenting the msgrcv() MSG_COPY flag that Stanislav
    Kinsbursky added in commit 4a674f34ba04 ("ipc: introduce message queue
    copy feature" => kernel 3.8), I discovered a couple of bugs in the
    implementation. The two bugs concern MSG_COPY interactions with other
    msgrcv() flags, namely:

    (A) MSG_COPY + MSG_EXCEPT
    (B) MSG_COPY + !IPC_NOWAIT

    The bugs are distinct (and the fix for the first one is obvious),
    however my fix for both is a single-line patch, which is why I'm
    combining them in a single mail, rather than writing two mails+patches.

    ===== (A) MSG_COPY + MSG_EXCEPT =====

    With the addition of the MSG_COPY flag, there are now two msgrcv()
    flags--MSG_COPY and MSG_EXCEPT--that modify the meaning of the 'msgtyp'
    argument in unrelated ways. Specifying both in the same call is a
    logical error that is currently permitted, with the effect that MSG_COPY
    has priority and MSG_EXCEPT is ignored. The call should give an error
    if both flags are specified. The patch below implements that behavior.

    ===== (B) (B) MSG_COPY + !IPC_NOWAIT =====

    The test code that was submitted in commit 3a665531a3b7 ("selftests: IPC
    message queue copy feature test") shows MSG_COPY being used in
    conjunction with IPC_NOWAIT. In other words, if there is no message at
    the position 'msgtyp'. return immediately with the error in ENOMSG.

    What was not (fully) tested is the behavior if MSG_COPY is specified
    *without* IPC_NOWAIT, and there is an odd behavior. If the queue
    contains less than 'msgtyp' messages, then the call blocks until the
    next message is written to the queue. At that point, the msgrcv() call
    returns a copy of the newly added message, regardless of whether that
    message is at the ordinal position 'msgtyp'. This is clearly bogus, and
    problematic for applications that might want to make use of the MSG_COPY
    flag.

    I considered the following possible solutions to this problem:

    (1) Force the call to block until a message *does* appear at the
    position 'msgtyp'.

    (2) If the MSG_COPY flag is specified, the kernel should implicitly add
    IPC_NOWAIT, so that the call fails with ENOMSG for this case.

    (3) If the MSG_COPY flag is specified, but IPC_NOWAIT is not, generate
    an error (probably, EINVAL is the right one).

    I do not know if any application would really want to have the
    functionality of solution (1), especially since an application can
    determine in advance the number of messages in the queue using msgctl()
    IPC_STAT. Obviously, this solution would be the most work to implement.

    Solution (2) would have the effect of silently fixing any applications
    that tried to employ broken behavior. However, it would mean that if we
    later decided to implement solution (1), then user-space could not
    easily detect what the kernel supports (but, since I'm somewhat doubtful
    that solution (1) is needed, I'm not sure that this is much of a
    problem).

    Solution (3) would have the effect of informing broken applications that
    they are doing something broken. The downside is that this would cause
    a ABI breakage for any applications that are currently employing the
    broken behavior. However:

    a) Those applications are almost certainly not getting the results they
    expect.
    b) Possibly, those applications don't even exist, because MSG_COPY is
    currently hidden behind CONFIG_CHECKPOINT_RESTORE.

    The upside of solution (3) is that if we later decided to implement
    solution (1), user-space could determine what the kernel supports, via
    the error return.

    In my view, solution (3) is mildly preferable to solution (2), and
    solution (1) could still be done later if anyone really cares. The
    patch below implements solution (3).

    PS. For anyone out there still listening, it's the usual story:
    documenting an API (and the thinking about, and the testing of the API,
    that documentation entails) is the one of the single best ways of
    finding bugs in the API, as I've learned from a lot of experience. Best
    to do that documentation before releasing the API.

    Signed-off-by: Michael Kerrisk
    Acked-by: Stanislav Kinsbursky
    Cc: Stanislav Kinsbursky
    Cc: stable@vger.kernel.org
    Cc: Serge Hallyn
    Cc: "Eric W. Biederman"
    Cc: Pavel Emelyanov
    Cc: Al Viro
    Cc: KOSAKI Motohiro
    Signed-off-by: Linus Torvalds

    Michael Kerrisk
     

06 Mar, 2014

3 commits


26 Feb, 2014

1 commit

  • Commit 93e6f119c0ce ("ipc/mqueue: cleanup definition names and
    locations") added global hardcoded limits to the amount of message
    queues that can be created. While these limits are per-namespace,
    reality is that it ends up breaking userspace applications.
    Historically users have, at least in theory, been able to create up to
    INT_MAX queues, and limiting it to just 1024 is way too low and dramatic
    for some workloads and use cases. For instance, Madars reports:

    "This update imposes bad limits on our multi-process application. As
    our app uses approaches that each process opens its own set of queues
    (usually something about 3-5 queues per process). In some scenarios
    we might run up to 3000 processes or more (which of-course for linux
    is not a problem). Thus we might need up to 9000 queues or more. All
    processes run under one user."

    Other affected users can be found in launchpad bug #1155695:
    https://bugs.launchpad.net/ubuntu/+source/manpages/+bug/1155695

    Instead of increasing this limit, revert it entirely and fallback to the
    original way of dealing queue limits -- where once a user's resource
    limit is reached, and all memory is used, new queues cannot be created.

    Signed-off-by: Davidlohr Bueso
    Reported-by: Madars Vitolins
    Acked-by: Doug Ledford
    Cc: Manfred Spraul
    Cc: [3.5+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     

03 Feb, 2014

1 commit

  • We have two APIs for compatiblity timespec/val, with confusingly
    similar names. compat_(get|put)_time(val|spec) *do* handle the case
    where COMPAT_USE_64BIT_TIME is set, whereas
    (get|put)_compat_time(val|spec) do not. This is an accident waiting
    to happen.

    Clean it up by favoring the full-service version; the limited version
    is replaced with double-underscore versions static to kernel/compat.c.

    A common pattern is to convert a struct timespec to kernel format in
    an allocation on the user stack. Unfortunately it is open-coded in
    several places. Since this allocation isn't actually needed if
    COMPAT_USE_64BIT_TIME is true (since user format == kernel format)
    encapsulate that whole pattern into the function
    compat_convert_timespec(). An equivalent function should be written
    for struct timeval if it is needed in the future.

    Finally, get rid of compat_(get|put)_timeval_convert(): each was only
    used once, and the latter was not even doing what the function said
    (no conversion actually was being done.) Moving the conversion into
    compat_sys_settimeofday() itself makes the code much more similar to
    sys_settimeofday() itself.

    v3: Remove unused compat_convert_timeval().

    v2: Drop bogus "const" in the destination argument for
    compat_convert_time*().

    Cc: Mauro Carvalho Chehab
    Cc: Alexander Viro
    Cc: Hans Verkuil
    Cc: Andrew Morton
    Cc: Heiko Carstens
    Cc: Manfred Spraul
    Cc: Mateusz Guzik
    Cc: Rafael Aquini
    Cc: Davidlohr Bueso
    Cc: Stephen Rothwell
    Cc: Dan Carpenter
    Cc: Arnd Bergmann
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Linus Torvalds
    Cc: Catalin Marinas
    Cc: Will Deacon
    Tested-by: H.J. Lu
    Signed-off-by: H. Peter Anvin

    H. Peter Anvin
     

28 Jan, 2014

1 commit

  • Compat function takes msgtyp argument as u32 and passes it down to
    do_msgrcv which results in casting to long, thus the sign is lost and we
    get a big positive number instead.

    Cast the argument to signed type before passing it down.

    Signed-off-by: Mateusz Guzik
    Reported-by: Gabriellla Schmidt
    Cc: Al Viro
    Cc: Davidlohr Bueso
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mateusz Guzik