08 Apr, 2014

2 commits

  • ... since __initcall is now deprecated.

    Signed-off-by: Davidlohr Bueso
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • This macro appears to have been introduced back in the 2.5 era for
    semtimedop32 backward compatibility on ia32:

    https://lkml.org/lkml/2003/4/28/78

    Nowadays, this syscall in compat just defaults back to the code found in
    sem.c, so it is no longer used and can thus be removed:

    long compat_sys_semtimedop(int semid, struct sembuf __user *tsems,
    unsigned nsops, const struct compat_timespec __user *timeout)
    {
    struct timespec __user *ts64;
    if (compat_convert_timespec(&ts64, timeout))
    return -EFAULT;
    return sys_semtimedop(semid, tsems, nsops, ts64);
    }

    Furthermore, there are no users in compat.c. After this change, kernel
    builds just fine with both CONFIG_SYSVIPC_COMPAT and CONFIG_SYSVIPC.

    Signed-off-by: Davidlohr Bueso
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     

03 Apr, 2014

1 commit

  • Pull compat time conversion changes from Peter Anvin:
    "Despite the branch name this is really neither an x86 nor an
    x32-specific patchset, although it the implementation of the
    discussions that followed the x32 security hole a few months ago.

    This removes get/put_compat_timespec/val() and replaces them with
    compat_get/put_timespec/val() which are savvy as to the current status
    of COMPAT_USE_64BIT_TIME.

    It removes several unused and/or incorrect/misleading functions (like
    compat_put_timeval_convert which doesn't in fact do any conversion)
    and also replaces several open-coded implementations what is now
    called compat_convert_timespec() with that function"

    * 'x86-x32-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    compat: Fix sparse address space warnings
    compat: Get rid of (get|put)_compat_time(val|spec)

    Linus Torvalds
     

01 Apr, 2014

1 commit

  • Pull s390 compat wrapper rework from Heiko Carstens:
    "S390 compat system call wrapper simplification work.

    The intention of this work is to get rid of all hand written assembly
    compat system call wrappers on s390, which perform proper sign or zero
    extension, or pointer conversion of compat system call parameters.
    Instead all of this should be done with C code eg by using Al's
    COMPAT_SYSCALL_DEFINEx() macro.

    Therefore all common code and s390 specific compat system calls have
    been converted to the COMPAT_SYSCALL_DEFINEx() macro.

    In order to generate correct code all compat system calls may only
    have eg compat_ulong_t parameters, but no unsigned long parameters.
    Those patches which change parameter types from unsigned long to
    compat_ulong_t parameters are separate in this series, but shouldn't
    cause any harm.

    The only compat system calls which intentionally have 64 bit
    parameters (preadv64 and pwritev64) in support of the x86/32 ABI
    haven't been changed, but are now only available if an architecture
    defines __ARCH_WANT_COMPAT_SYS_PREADV64/PWRITEV64.

    System calls which do not have a compat variant but still need proper
    zero extension on s390, like eg "long sys_brk(unsigned long brk)" will
    get a proper wrapper function with the new s390 specific
    COMPAT_SYSCALL_WRAPx() macro:

    COMPAT_SYSCALL_WRAP1(brk, unsigned long, brk);

    which generates the following code (simplified):

    asmlinkage long sys_brk(unsigned long brk);
    asmlinkage long compat_sys_brk(long brk)
    {
    return sys_brk((u32)brk);
    }

    Given that the C file which contains all the COMPAT_SYSCALL_WRAP lines
    includes both linux/syscall.h and linux/compat.h, it will generate
    build errors, if the declaration of sys_brk() doesn't match, or if
    there exists a non-matching compat_sys_brk() declaration.

    In addition this will intentionally result in a link error if
    somewhere else a compat_sys_brk() function exists, which probably
    should have been used instead. Two more BUILD_BUG_ONs make sure the
    size and type of each compat syscall parameter can be handled
    correctly with the s390 specific macros.

    I converted the compat system calls step by step to verify the
    generated code is correct and matches the previous code. In fact it
    did not always match, however that was always a bug in the hand
    written asm code.

    In result we get less code, less bugs, and much more sanity checking"

    * 'compat' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (44 commits)
    s390/compat: add copyright statement
    compat: include linux/unistd.h within linux/compat.h
    s390/compat: get rid of compat wrapper assembly code
    s390/compat: build error for large compat syscall args
    mm/compat: convert to COMPAT_SYSCALL_DEFINE with changing parameter types
    kexec/compat: convert to COMPAT_SYSCALL_DEFINE with changing parameter types
    net/compat: convert to COMPAT_SYSCALL_DEFINE with changing parameter types
    ipc/compat: convert to COMPAT_SYSCALL_DEFINE with changing parameter types
    fs/compat: convert to COMPAT_SYSCALL_DEFINE with changing parameter types
    ipc/compat: convert to COMPAT_SYSCALL_DEFINE
    fs/compat: convert to COMPAT_SYSCALL_DEFINE
    security/compat: convert to COMPAT_SYSCALL_DEFINE
    mm/compat: convert to COMPAT_SYSCALL_DEFINE
    net/compat: convert to COMPAT_SYSCALL_DEFINE
    kernel/compat: convert to COMPAT_SYSCALL_DEFINE
    fs/compat: optional preadv64/pwrite64 compat system calls
    ipc/compat_sys_msgrcv: change msgtyp type from long to compat_long_t
    s390/compat: partial parameter conversion within syscall wrappers
    s390/compat: automatic zero, sign and pointer conversion of syscalls
    s390/compat: add sync_file_range and fallocate compat syscalls
    ...

    Linus Torvalds
     

17 Mar, 2014

1 commit

  • While testing and documenting the msgrcv() MSG_COPY flag that Stanislav
    Kinsbursky added in commit 4a674f34ba04 ("ipc: introduce message queue
    copy feature" => kernel 3.8), I discovered a couple of bugs in the
    implementation. The two bugs concern MSG_COPY interactions with other
    msgrcv() flags, namely:

    (A) MSG_COPY + MSG_EXCEPT
    (B) MSG_COPY + !IPC_NOWAIT

    The bugs are distinct (and the fix for the first one is obvious),
    however my fix for both is a single-line patch, which is why I'm
    combining them in a single mail, rather than writing two mails+patches.

    ===== (A) MSG_COPY + MSG_EXCEPT =====

    With the addition of the MSG_COPY flag, there are now two msgrcv()
    flags--MSG_COPY and MSG_EXCEPT--that modify the meaning of the 'msgtyp'
    argument in unrelated ways. Specifying both in the same call is a
    logical error that is currently permitted, with the effect that MSG_COPY
    has priority and MSG_EXCEPT is ignored. The call should give an error
    if both flags are specified. The patch below implements that behavior.

    ===== (B) (B) MSG_COPY + !IPC_NOWAIT =====

    The test code that was submitted in commit 3a665531a3b7 ("selftests: IPC
    message queue copy feature test") shows MSG_COPY being used in
    conjunction with IPC_NOWAIT. In other words, if there is no message at
    the position 'msgtyp'. return immediately with the error in ENOMSG.

    What was not (fully) tested is the behavior if MSG_COPY is specified
    *without* IPC_NOWAIT, and there is an odd behavior. If the queue
    contains less than 'msgtyp' messages, then the call blocks until the
    next message is written to the queue. At that point, the msgrcv() call
    returns a copy of the newly added message, regardless of whether that
    message is at the ordinal position 'msgtyp'. This is clearly bogus, and
    problematic for applications that might want to make use of the MSG_COPY
    flag.

    I considered the following possible solutions to this problem:

    (1) Force the call to block until a message *does* appear at the
    position 'msgtyp'.

    (2) If the MSG_COPY flag is specified, the kernel should implicitly add
    IPC_NOWAIT, so that the call fails with ENOMSG for this case.

    (3) If the MSG_COPY flag is specified, but IPC_NOWAIT is not, generate
    an error (probably, EINVAL is the right one).

    I do not know if any application would really want to have the
    functionality of solution (1), especially since an application can
    determine in advance the number of messages in the queue using msgctl()
    IPC_STAT. Obviously, this solution would be the most work to implement.

    Solution (2) would have the effect of silently fixing any applications
    that tried to employ broken behavior. However, it would mean that if we
    later decided to implement solution (1), then user-space could not
    easily detect what the kernel supports (but, since I'm somewhat doubtful
    that solution (1) is needed, I'm not sure that this is much of a
    problem).

    Solution (3) would have the effect of informing broken applications that
    they are doing something broken. The downside is that this would cause
    a ABI breakage for any applications that are currently employing the
    broken behavior. However:

    a) Those applications are almost certainly not getting the results they
    expect.
    b) Possibly, those applications don't even exist, because MSG_COPY is
    currently hidden behind CONFIG_CHECKPOINT_RESTORE.

    The upside of solution (3) is that if we later decided to implement
    solution (1), user-space could determine what the kernel supports, via
    the error return.

    In my view, solution (3) is mildly preferable to solution (2), and
    solution (1) could still be done later if anyone really cares. The
    patch below implements solution (3).

    PS. For anyone out there still listening, it's the usual story:
    documenting an API (and the thinking about, and the testing of the API,
    that documentation entails) is the one of the single best ways of
    finding bugs in the API, as I've learned from a lot of experience. Best
    to do that documentation before releasing the API.

    Signed-off-by: Michael Kerrisk
    Acked-by: Stanislav Kinsbursky
    Cc: Stanislav Kinsbursky
    Cc: stable@vger.kernel.org
    Cc: Serge Hallyn
    Cc: "Eric W. Biederman"
    Cc: Pavel Emelyanov
    Cc: Al Viro
    Cc: KOSAKI Motohiro
    Signed-off-by: Linus Torvalds

    Michael Kerrisk
     

06 Mar, 2014

3 commits


26 Feb, 2014

1 commit

  • Commit 93e6f119c0ce ("ipc/mqueue: cleanup definition names and
    locations") added global hardcoded limits to the amount of message
    queues that can be created. While these limits are per-namespace,
    reality is that it ends up breaking userspace applications.
    Historically users have, at least in theory, been able to create up to
    INT_MAX queues, and limiting it to just 1024 is way too low and dramatic
    for some workloads and use cases. For instance, Madars reports:

    "This update imposes bad limits on our multi-process application. As
    our app uses approaches that each process opens its own set of queues
    (usually something about 3-5 queues per process). In some scenarios
    we might run up to 3000 processes or more (which of-course for linux
    is not a problem). Thus we might need up to 9000 queues or more. All
    processes run under one user."

    Other affected users can be found in launchpad bug #1155695:
    https://bugs.launchpad.net/ubuntu/+source/manpages/+bug/1155695

    Instead of increasing this limit, revert it entirely and fallback to the
    original way of dealing queue limits -- where once a user's resource
    limit is reached, and all memory is used, new queues cannot be created.

    Signed-off-by: Davidlohr Bueso
    Reported-by: Madars Vitolins
    Acked-by: Doug Ledford
    Cc: Manfred Spraul
    Cc: [3.5+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     

03 Feb, 2014

1 commit

  • We have two APIs for compatiblity timespec/val, with confusingly
    similar names. compat_(get|put)_time(val|spec) *do* handle the case
    where COMPAT_USE_64BIT_TIME is set, whereas
    (get|put)_compat_time(val|spec) do not. This is an accident waiting
    to happen.

    Clean it up by favoring the full-service version; the limited version
    is replaced with double-underscore versions static to kernel/compat.c.

    A common pattern is to convert a struct timespec to kernel format in
    an allocation on the user stack. Unfortunately it is open-coded in
    several places. Since this allocation isn't actually needed if
    COMPAT_USE_64BIT_TIME is true (since user format == kernel format)
    encapsulate that whole pattern into the function
    compat_convert_timespec(). An equivalent function should be written
    for struct timeval if it is needed in the future.

    Finally, get rid of compat_(get|put)_timeval_convert(): each was only
    used once, and the latter was not even doing what the function said
    (no conversion actually was being done.) Moving the conversion into
    compat_sys_settimeofday() itself makes the code much more similar to
    sys_settimeofday() itself.

    v3: Remove unused compat_convert_timeval().

    v2: Drop bogus "const" in the destination argument for
    compat_convert_time*().

    Cc: Mauro Carvalho Chehab
    Cc: Alexander Viro
    Cc: Hans Verkuil
    Cc: Andrew Morton
    Cc: Heiko Carstens
    Cc: Manfred Spraul
    Cc: Mateusz Guzik
    Cc: Rafael Aquini
    Cc: Davidlohr Bueso
    Cc: Stephen Rothwell
    Cc: Dan Carpenter
    Cc: Arnd Bergmann
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Linus Torvalds
    Cc: Catalin Marinas
    Cc: Will Deacon
    Tested-by: H.J. Lu
    Signed-off-by: H. Peter Anvin

    H. Peter Anvin
     

28 Jan, 2014

11 commits

  • Compat function takes msgtyp argument as u32 and passes it down to
    do_msgrcv which results in casting to long, thus the sign is lost and we
    get a big positive number instead.

    Cast the argument to signed type before passing it down.

    Signed-off-by: Mateusz Guzik
    Reported-by: Gabriellla Schmidt
    Cc: Al Viro
    Cc: Davidlohr Bueso
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mateusz Guzik
     
  • Both expunge_all() and pipeline_send() rely on both a nil msg value and
    a full barrier to guarantee the correct ordering when waking up a task.

    While its counterpart at the receiving end is well documented for the
    lockless recv algorithm, we still need to document these specific
    smp_mb() calls.

    [akpm@linux-foundation.org: fix typo, per Mike]
    [akpm@linux-foundation.org: mroe tpyos]
    Signed-off-by: Davidlohr Bueso
    Cc: Aswin Chandramouleeswaran
    Cc: Rik van Riel
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • This field is only used to reset the ids seq number if it exceeds the
    smaller of INT_MAX/SEQ_MULTIPLIER and USHRT_MAX, and can therefore be
    moved out of the structure and into its own macro. Since each
    ipc_namespace contains a table of 3 pointers to struct ipc_ids we can
    save space in instruction text:

    text data bss dec hex filename
    56232 2348 24 58604 e4ec ipc/built-in.o
    56216 2348 24 58588 e4dc ipc/built-in.o-after

    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Jonathan Gonzalez
    Cc: Aswin Chandramouleeswaran
    Cc: Rik van Riel
    Acked-by: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • Get rid of silly/useless label jumping.

    Signed-off-by: Davidlohr Bueso
    Cc: Aswin Chandramouleeswaran
    Cc: Rik van Riel
    Acked-by: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • Only found in ipc_rmid().

    Signed-off-by: Davidlohr Bueso
    Cc: Aswin Chandramouleeswaran
    Cc: Rik van Riel
    Acked-by: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • Deal with checkpatch messages:
    WARNING: braces {} are not necessary for single statement blocks

    Signed-off-by: Davidlohr Bueso
    Cc: Aswin Chandramouleeswaran
    Cc: Rik van Riel
    Acked-by: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • IPC commenting style is all over the place, *specially* in util.c. This
    patch orders things a bit.

    Signed-off-by: Davidlohr Bueso
    Cc: Aswin Chandramouleeswaran
    Cc: Rik van Riel
    Acked-by: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • The ipc code does not adhere the typical linux coding style.
    This patch fixes lots of simple whitespace errors.

    - mostly autogenerated by
    scripts/checkpatch.pl -f --fix \
    --types=pointer_location,spacing,space_before_tab
    - one manual fixup (keep structure members tab-aligned)
    - removal of additional space_before_tab that were not found by --fix

    Tested with some of my msg and sem test apps.

    Andrew: Could you include it in -mm and move it towards Linus' tree?

    Signed-off-by: Manfred Spraul
    Suggested-by: Li Bin
    Cc: Joe Perches
    Acked-by: Rafael Aquini
    Cc: Davidlohr Bueso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • struct kern_ipc_perm.deleted is meant to be used as a boolean toggle, and
    the changes introduced by this patch are just to make the case explicit.

    Signed-off-by: Rafael Aquini
    Reviewed-by: Rik van Riel
    Cc: Greg Thelen
    Acked-by: Davidlohr Bueso
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael Aquini
     
  • After the locking semantics for the SysV IPC API got improved, a couple
    of IPC_RMID race windows were opened because we ended up dropping the
    'kern_ipc_perm.deleted' check performed way down in ipc_lock(). The
    spotted races got sorted out by re-introducing the old test within the
    racy critical sections.

    This patch introduces ipc_valid_object() to consolidate the way we cope
    with IPC_RMID races by using the same abstraction across the API
    implementation.

    Signed-off-by: Rafael Aquini
    Acked-by: Rik van Riel
    Acked-by: Greg Thelen
    Reviewed-by: Davidlohr Bueso
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael Aquini
     
  • When trying to understand semop code, I found a small mistake in the check
    for semadj (undo) value overflow. The new undo value is not stored
    immediately and next potential checks are done against the old value.

    The failing scenario is not much practical. One semop call has to do more
    operations on the same semaphore. Also semval and semadj must have
    different values, so there has to be some operations without SEM_UNDO
    flag. For example:

    struct sembuf depositor_op[1];
    struct sembuf collector_op[2];

    depositor_op[0].sem_num = 0;
    depositor_op[0].sem_op = 20000;
    depositor_op[0].sem_flg = 0;

    collector_op[0].sem_num = 0;
    collector_op[0].sem_op = -10000;
    collector_op[0].sem_flg = SEM_UNDO;
    collector_op[1].sem_num = 0;
    collector_op[1].sem_op = -10000;
    collector_op[1].sem_flg = SEM_UNDO;

    if (semop(semid, depositor_op, 1) == -1)
    { perror("Failed to do 1st deposit"); return 1; }

    if (semop(semid, collector_op, 2) == -1)
    { perror("Failed to do 1st collect"); return 1; }

    if (semop(semid, depositor_op, 1) == -1)
    { perror("Failed to do 2nd deposit"); return 1; }

    if (semop(semid, collector_op, 2) == -1)
    { perror("Failed to do 2nd collect"); return 1; }

    return 0;

    It passes without error now but the semadj value has overflown in the 2nd
    collector operation.

    [akpm@linux-foundation.org: restore lessened scope of local `undo']
    [davidlohr@hp.com: correct header comment for perform_atomic_semop]
    Signed-off-by: Petr Mladek
    Acked-by: Davidlohr Bueso
    Acked-by: Manfred Spraul
    Cc: Jiri Kosina
    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Petr Mladek
     

22 Nov, 2013

2 commits

  • Commit 2caacaa82a51 ("ipc,shm: shorten critical region for shmctl")
    restructured the ipc shm to shorten critical region, but introduced a
    path where the return value could be -EPERM, even if the operation
    actually was performed.

    Before the commit, the err return value was reset by the return value
    from security_shm_shmctl() after the if (!ns_capable(...)) statement.

    Now, we still exit the if statement with err set to -EPERM, and in the
    case of SHM_UNLOCK, it is not reset at all, and used as the return value
    from shmctl.

    To fix this, we only set err when errors occur, leaving the fallthrough
    case alone.

    Signed-off-by: Jesper Nilsson
    Cc: Davidlohr Bueso
    Cc: Rik van Riel
    Cc: Michel Lespinasse
    Cc: Al Viro
    Cc: [3.12.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Nilsson
     
  • When IPC_RMID races with other shm operations there's potential for
    use-after-free of the shm object's associated file (shm_file).

    Here's the race before this patch:

    TASK 1 TASK 2
    ------ ------
    shm_rmid()
    ipc_lock_object()
    shmctl()
    shp = shm_obtain_object_check()

    shm_destroy()
    shum_unlock()
    fput(shp->shm_file)
    ipc_lock_object()
    shmem_lock(shp->shm_file)

    The oops is caused because shm_destroy() calls fput() after dropping the
    ipc_lock. fput() clears the file's f_inode, f_path.dentry, and
    f_path.mnt, which causes various NULL pointer references in task 2. I
    reliably see the oops in task 2 if with shmlock, shmu

    This patch fixes the races by:
    1) set shm_file=NULL in shm_destroy() while holding ipc_object_lock().
    2) modify at risk operations to check shm_file while holding
    ipc_object_lock().

    Example workloads, which each trigger oops...

    Workload 1:
    while true; do
    id=$(shmget 1 4096)
    shm_rmid $id &
    shmlock $id &
    wait
    done

    The oops stack shows accessing NULL f_inode due to racing fput:
    _raw_spin_lock
    shmem_lock
    SyS_shmctl

    Workload 2:
    while true; do
    id=$(shmget 1 4096)
    shmat $id 4096 &
    shm_rmid $id &
    wait
    done

    The oops stack is similar to workload 1 due to NULL f_inode:
    touch_atime
    shmem_mmap
    shm_mmap
    mmap_region
    do_mmap_pgoff
    do_shmat
    SyS_shmat

    Workload 3:
    while true; do
    id=$(shmget 1 4096)
    shmlock $id
    shm_rmid $id &
    shmunlock $id &
    wait
    done

    The oops stack shows second fput tripping on an NULL f_inode. The
    first fput() completed via from shm_destroy(), but a racing thread did
    a get_file() and queued this fput():
    locks_remove_flock
    __fput
    ____fput
    task_work_run
    do_notify_resume
    int_signal

    Fixes: c2c737a0461e ("ipc,shm: shorten critical region for shmat")
    Fixes: 2caacaa82a51 ("ipc,shm: shorten critical region for shmctl")
    Signed-off-by: Greg Thelen
    Cc: Davidlohr Bueso
    Cc: Rik van Riel
    Cc: Manfred Spraul
    Cc: # 3.10.17+ 3.11.6+
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Thelen
     

13 Nov, 2013

4 commits

  • Merge first patch-bomb from Andrew Morton:
    "Quite a lot of other stuff is banked up awaiting further
    next->mainline merging, but this batch contains:

    - Lots of random misc patches
    - OCFS2
    - Most of MM
    - backlight updates
    - lib/ updates
    - printk updates
    - checkpatch updates
    - epoll tweaking
    - rtc updates
    - hfs
    - hfsplus
    - documentation
    - procfs
    - update gcov to gcc-4.7 format
    - IPC"

    * emailed patches from Andrew Morton : (269 commits)
    ipc, msg: fix message length check for negative values
    ipc/util.c: remove unnecessary work pending test
    devpts: plug the memory leak in kill_sb
    ./Makefile: export initial ramdisk compression config option
    init/Kconfig: add option to disable kernel compression
    drivers: w1: make w1_slave::flags long to avoid memory corruption
    drivers/w1/masters/ds1wm.cuse dev_get_platdata()
    drivers/memstick/core/ms_block.c: fix unreachable state in h_msb_read_page()
    drivers/memstick/core/mspro_block.c: fix attributes array allocation
    drivers/pps/clients/pps-gpio.c: remove redundant of_match_ptr
    kernel/panic.c: reduce 1 byte usage for print tainted buffer
    gcov: reuse kbasename helper
    kernel/gcov/fs.c: use pr_warn()
    kernel/module.c: use pr_foo()
    gcov: compile specific gcov implementation based on gcc version
    gcov: add support for gcc 4.7 gcov format
    gcov: move gcov structs definitions to a gcc version specific file
    kernel/taskstats.c: return -ENOMEM when alloc memory fails in add_del_listener()
    kernel/taskstats.c: add nla_nest_cancel() for failure processing between nla_nest_start() and nla_nest_end()
    kernel/sysctl_binary.c: use scnprintf() instead of snprintf()
    ...

    Linus Torvalds
     
  • Pull vfs updates from Al Viro:
    "All kinds of stuff this time around; some more notable parts:

    - RCU'd vfsmounts handling
    - new primitives for coredump handling
    - files_lock is gone
    - Bruce's delegations handling series
    - exportfs fixes

    plus misc stuff all over the place"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (101 commits)
    ecryptfs: ->f_op is never NULL
    locks: break delegations on any attribute modification
    locks: break delegations on link
    locks: break delegations on rename
    locks: helper functions for delegation breaking
    locks: break delegations on unlink
    namei: minor vfs_unlink cleanup
    locks: implement delegations
    locks: introduce new FL_DELEG lock flag
    vfs: take i_mutex on renamed file
    vfs: rename I_MUTEX_QUOTA now that it's not used for quotas
    vfs: don't use PARENT/CHILD lock classes for non-directories
    vfs: pull ext4's double-i_mutex-locking into common code
    exportfs: fix quadratic behavior in filehandle lookup
    exportfs: better variable name
    exportfs: move most of reconnect_path to helper function
    exportfs: eliminate unused "noprogress" counter
    exportfs: stop retrying once we race with rename/remove
    exportfs: clear DISCONNECTED on all parents sooner
    exportfs: more detailed comment for path_reconnect
    ...

    Linus Torvalds
     
  • On 64 bit systems the test for negative message sizes is bogus as the
    size, which may be positive when evaluated as a long, will get truncated
    to an int when passed to load_msg(). So a long might very well contain a
    positive value but when truncated to an int it would become negative.

    That in combination with a small negative value of msg_ctlmax (which will
    be promoted to an unsigned type for the comparison against msgsz, making
    it a big positive value and therefore make it pass the check) will lead to
    two problems: 1/ The kmalloc() call in alloc_msg() will allocate a too
    small buffer as the addition of alen is effectively a subtraction. 2/ The
    copy_from_user() call in load_msg() will first overflow the buffer with
    userland data and then, when the userland access generates an access
    violation, the fixup handler copy_user_handle_tail() will try to fill the
    remainder with zeros -- roughly 4GB. That almost instantly results in a
    system crash or reset.

    ,-[ Reproducer (needs to be run as root) ]--
    | #include
    | #include
    | #include
    | #include
    |
    | int main(void) {
    | long msg = 1;
    | int fd;
    |
    | fd = open("/proc/sys/kernel/msgmax", O_WRONLY);
    | write(fd, "-1", 2);
    | close(fd);
    |
    | msgsnd(0, &msg, 0xfffffff0, IPC_NOWAIT);
    |
    | return 0;
    | }
    '---

    Fix the issue by preventing msgsz from getting truncated by consistently
    using size_t for the message length. This way the size checks in
    do_msgsnd() could still be passed with a negative value for msg_ctlmax but
    we would fail on the buffer allocation in that case and error out.

    Also change the type of m_ts from int to size_t to avoid similar nastiness
    in other code paths -- it is used in similar constructs, i.e. signed vs.
    unsigned checks. It should never become negative under normal
    circumstances, though.

    Setting msg_ctlmax to a negative value is an odd configuration and should
    be prevented. As that might break existing userland, it will be handled
    in a separate commit so it could easily be reverted and reworked without
    reintroducing the above described bug.

    Hardening mechanisms for user copy operations would have catched that bug
    early -- e.g. checking slab object sizes on user copy operations as the
    usercopy feature of the PaX patch does. Or, for that matter, detect the
    long vs. int sign change due to truncation, as the size overflow plugin
    of the very same patch does.

    [akpm@linux-foundation.org: fix i386 min() warnings]
    Signed-off-by: Mathias Krause
    Cc: Pax Team
    Cc: Davidlohr Bueso
    Cc: Brad Spengler
    Cc: Manfred Spraul
    Cc: [ v2.3.27+ -- yes, that old ;) ]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mathias Krause
     
  • Remove unnecessary work pending test before calling schedule_work(). It
    has been tested in queue_work_on() already. No functional changed.

    Signed-off-by: Xie XiuQi
    Cc: Tejun Heo
    Reviewed-by: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xie XiuQi
     

09 Nov, 2013

1 commit

  • We need to break delegations on any operation that changes the set of
    links pointing to an inode. Start with unlink.

    Such operations also hold the i_mutex on a parent directory. Breaking a
    delegation may require waiting for a timeout (by default 90 seconds) in
    the case of a unresponsive NFS client. To avoid blocking all directory
    operations, we therefore drop locks before waiting for the delegation.
    The logic then looks like:

    acquire locks
    ...
    test for delegation; if found:
    take reference on inode
    release locks
    wait for delegation break
    drop reference on inode
    retry

    It is possible this could never terminate. (Even if we take precautions
    to prevent another delegation being acquired on the same inode, we could
    get a different inode on each retry.) But this seems very unlikely.

    The initial test for a delegation happens after the lock on the target
    inode is acquired, but the directory inode may have been acquired
    further up the call stack. We therefore add a "struct inode **"
    argument to any intervening functions, which we use to pass the inode
    back up to the caller in the case it needs a delegation synchronously
    broken.

    Cc: David Howells
    Cc: Tyler Hicks
    Cc: Dustin Kirkland
    Acked-by: Jeff Layton
    Signed-off-by: J. Bruce Fields
    Signed-off-by: Al Viro

    J. Bruce Fields
     

04 Nov, 2013

1 commit

  • Negative message lengths make no sense -- so don't do negative queue
    lenghts or identifier counts. Prevent them from getting negative.

    Also change the underlying data types to be unsigned to avoid hairy
    surprises with sign extensions in cases where those variables get
    evaluated in unsigned expressions with bigger data types, e.g size_t.

    In case a user still wants to have "unlimited" sizes she could just use
    INT_MAX instead.

    Signed-off-by: Mathias Krause
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mathias Krause
     

17 Oct, 2013

2 commits

  • After acquiring the semlock spinlock, operations must test that the
    array is still valid.

    - semctl() and exit_sem() would walk stale linked lists (ugly, but
    should be ok: all lists are empty)

    - semtimedop() would sleep forever - and if woken up due to a signal -
    access memory after free.

    The patch also:
    - standardizes the tests for .deleted, so that all tests in one
    function leave the function with the same approach.
    - unconditionally tests for .deleted immediately after every call to
    sem_lock - even it it means that for semctl(GETALL), .deleted will be
    tested twice.

    Both changes make the review simpler: After every sem_lock, there must
    be a test of .deleted, followed by a goto to the cleanup code (if the
    function uses "goto cleanup").

    The only exception is semctl_down(): If sem_ids().rwsem is locked, then
    the presence in ids->ipcs_idr is equivalent to !.deleted, thus no
    additional test is required.

    Signed-off-by: Manfred Spraul
    Cc: Mike Galbraith
    Acked-by: Davidlohr Bueso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • The initial documentation was a bit incomplete, update accordingly.

    [akpm@linux-foundation.org: make it more readable in 80 columns]
    Signed-off-by: Davidlohr Bueso
    Acked-by: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     

01 Oct, 2013

5 commits

  • This fixes a race in both msgrcv() and msgsnd() between finding the msg
    and actually dealing with the queue, as another thread can delete shmid
    underneath us if we are preempted before acquiring the
    kern_ipc_perm.lock.

    Manfred illustrates this nicely:

    Assume a preemptible kernel that is preempted just after

    msq = msq_obtain_object_check(ns, msqid)

    in do_msgrcv(). The only lock that is held is rcu_read_lock().

    Now the other thread processes IPC_RMID. When the first task is
    resumed, then it will happily wait for messages on a deleted queue.

    Fix this by checking for if the queue has been deleted after taking the
    lock.

    Signed-off-by: Davidlohr Bueso
    Reported-by: Manfred Spraul
    Cc: Rik van Riel
    Cc: Mike Galbraith
    Cc: [3.11]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • In commit 0a2b9d4c7967 ("ipc/sem.c: move wake_up_process out of the
    spinlock section"), the update of semaphore's sem_otime(last semop time)
    was moved to one central position (do_smart_update).

    But since do_smart_update() is only called for operations that modify
    the array, this means that wait-for-zero semops do not update sem_otime
    anymore.

    The fix is simple:
    Non-alter operations must update sem_otime.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Manfred Spraul
    Reported-by: Jia He
    Tested-by: Jia He
    Cc: Davidlohr Bueso
    Cc: Mike Galbraith
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • The proc interface is not aware of sem_lock(), it instead calls
    ipc_lock_object() directly. This means that simple semop() operations
    can run in parallel with the proc interface. Right now, this is
    uncritical, because the implementation doesn't do anything that requires
    a proper synchronization.

    But it is dangerous and therefore should be fixed.

    Signed-off-by: Manfred Spraul
    Cc: Davidlohr Bueso
    Cc: Mike Galbraith
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • Operations that need access to the whole array must guarantee that there
    are no simple operations ongoing. Right now this is achieved by
    spin_unlock_wait(sem->lock) on all semaphores.

    If complex_count is nonzero, then this spin_unlock_wait() is not
    necessary, because it was already performed in the past by the thread
    that increased complex_count and even though sem_perm.lock was dropped
    inbetween, no simple operation could have started, because simple
    operations cannot start when complex_count is non-zero.

    Signed-off-by: Manfred Spraul
    Cc: Mike Galbraith
    Cc: Rik van Riel
    Reviewed-by: Davidlohr Bueso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • The exclusion of complex operations in sem_lock() is insufficient: after
    acquiring the per-semaphore lock, a simple op must first check that
    sem_perm.lock is not locked and only after that test check
    complex_count. The current code does it the other way around - and that
    creates a race. Details are below.

    The patch is a complete rewrite of sem_lock(), based in part on the code
    from Mike Galbraith. It removes all gotos and all loops and thus the
    risk of livelocks.

    I have tested the patch (together with the next one) on my i3 laptop and
    it didn't cause any problems.

    The bug is probably also present in 3.10 and 3.11, but for these kernels
    it might be simpler just to move the test of sma->complex_count after
    the spin_is_locked() test.

    Details of the bug:

    Assume:
    - sma->complex_count = 0.
    - Thread 1: semtimedop(complex op that must sleep)
    - Thread 2: semtimedop(simple op).

    Pseudo-Trace:

    Thread 1: sem_lock(): acquire sem_perm.lock
    Thread 1: sem_lock(): check for ongoing simple ops
    Nothing ongoing, thread 2 is still before sem_lock().
    Thread 1: try_atomic_semop()
    <<< preempted.

    Thread 2: sem_lock():
    static inline int sem_lock(struct sem_array *sma, struct sembuf *sops,
    int nsops)
    {
    int locknum;
    again:
    if (nsops == 1 && !sma->complex_count) {
    struct sem *sem = sma->sem_base + sops->sem_num;

    /* Lock just the semaphore we are interested in. */
    spin_lock(&sem->lock);

    /*
    * If sma->complex_count was set while we were spinning,
    * we may need to look at things we did not lock here.
    */
    if (unlikely(sma->complex_count)) {
    spin_unlock(&sem->lock);
    goto lock_array;
    }
    <<<<<<<<<
    <<< complex_count is still 0.
    <<<
    <<< Here it is preempted
    <<<<<<<<<

    Thread 1: try_atomic_semop() returns, notices that it must sleep.
    Thread 1: increases sma->complex_count.
    Thread 1: drops sem_perm.lock
    Thread 2:
    /*
    * Another process is holding the global lock on the
    * sem_array; we cannot enter our critical section,
    * but have to wait for the global lock to be released.
    */
    if (unlikely(spin_is_locked(&sma->sem_perm.lock))) {
    spin_unlock(&sem->lock);
    spin_unlock_wait(&sma->sem_perm.lock);
    goto again;
    }
    <<< sem_perm.lock already dropped, thus no "goto again;"

    locknum = sops->sem_num;

    Signed-off-by: Manfred Spraul
    Cc: Mike Galbraith
    Cc: Rik van Riel
    Cc: Davidlohr Bueso
    Cc: [3.10+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     

25 Sep, 2013

1 commit

  • Currently, IPC mechanisms do security and auditing related checks under
    RCU. However, since security modules can free the security structure,
    for example, through selinux_[sem,msg_queue,shm]_free_security(), we can
    race if the structure is freed before other tasks are done with it,
    creating a use-after-free condition. Manfred illustrates this nicely,
    for instance with shared mem and selinux:

    -> do_shmat calls rcu_read_lock()
    -> do_shmat calls shm_object_check().
    Checks that the object is still valid - but doesn't acquire any locks.
    Then it returns.
    -> do_shmat calls security_shm_shmat (e.g. selinux_shm_shmat)
    -> selinux_shm_shmat calls ipc_has_perm()
    -> ipc_has_perm accesses ipc_perms->security

    shm_close()
    -> shm_close acquires rw_mutex & shm_lock
    -> shm_close calls shm_destroy
    -> shm_destroy calls security_shm_free (e.g. selinux_shm_free_security)
    -> selinux_shm_free_security calls ipc_free_security(&shp->shm_perm)
    -> ipc_free_security calls kfree(ipc_perms->security)

    This patch delays the freeing of the security structures after all RCU
    readers are done. Furthermore it aligns the security life cycle with
    that of the rest of IPC - freeing them based on the reference counter.
    For situations where we need not free security, the current behavior is
    kept. Linus states:

    "... the old behavior was suspect for another reason too: having the
    security blob go away from under a user sounds like it could cause
    various other problems anyway, so I think the old code was at least
    _prone_ to bugs even if it didn't have catastrophic behavior."

    I have tested this patch with IPC testcases from LTP on both my
    quad-core laptop and on a 64 core NUMA server. In both cases selinux is
    enabled, and tests pass for both voluntary and forced preemption models.
    While the mentioned races are theoretical (at least no one as reported
    them), I wanted to make sure that this new logic doesn't break anything
    we weren't aware of.

    Suggested-by: Linus Torvalds
    Signed-off-by: Davidlohr Bueso
    Acked-by: Manfred Spraul
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     

12 Sep, 2013

3 commits

  • No remaining users, we now use ipc_obtain_object_check().

    Signed-off-by: Davidlohr Bueso
    Cc: Sedat Dilek
    Cc: Rik van Riel
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • This function was replaced by a the lockless shm_obtain_object_check(),
    and no longer has any users.

    Signed-off-by: Davidlohr Bueso
    Cc: Sedat Dilek
    Cc: Rik van Riel
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • After previous cleanups and optimizations, this function is no longer
    heavily used and we don't have a good reason to keep it. Update the few
    remaining callers and get rid of it.

    Signed-off-by: Davidlohr Bueso
    Cc: Sedat Dilek
    Cc: Rik van Riel
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso