05 Sep, 2018

1 commit

  • [ Upstream commit f075faa300acc4f6301e348acde0a4580ed5f77c ]

    In order for load/store tearing prevention to work, _all_ accesses to
    the variable in question need to be done around READ and WRITE_ONCE()
    macros. Ensure everyone does so for q->status variable for
    semtimedop().

    Link: http://lkml.kernel.org/r/20180717052654.676-1-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Davidlohr Bueso
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

15 Sep, 2017

1 commit

  • Pull ipc compat cleanup and 64-bit time_t from Al Viro:
    "IPC copyin/copyout sanitizing, including 64bit time_t work from Deepa
    Dinamani"

    * 'work.ipc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    utimes: Make utimes y2038 safe
    ipc: shm: Make shmid_kernel timestamps y2038 safe
    ipc: sem: Make sem_array timestamps y2038 safe
    ipc: msg: Make msg_queue timestamps y2038 safe
    ipc: mqueue: Replace timespec with timespec64
    ipc: Make sys_semtimedop() y2038 safe
    get rid of SYSVIPC_COMPAT on ia64
    semtimedop(): move compat to native
    shmat(2): move compat to native
    msgrcv(2), msgsnd(2): move compat to native
    ipc(2): move compat to native
    ipc: make use of compat ipc_perm helpers
    semctl(): move compat to native
    semctl(): separate all layout-dependent copyin/copyout
    msgctl(): move compat to native
    msgctl(): split the actual work from copyin/copyout
    ipc: move compat shmctl to native
    shmctl: split the work from copyin/copyout

    Linus Torvalds
     

09 Sep, 2017

4 commits

  • ipc_findkey() used to scan all objects to look for the wanted key. This
    is slow when using a high number of keys. This change adds an rhashtable
    of kern_ipc_perm objects in ipc_ids, so that one lookup cease to be O(n).

    This change gives a 865% improvement of benchmark reaim.jobs_per_min on a
    56 threads Intel(R) Xeon(R) CPU E5-2695 v3 @ 2.30GHz with 256G memory [1]

    Other (more micro) benchmark results, by the author: On an i5 laptop, the
    following loop executed right after a reboot took, without and with this
    change:

    for (int i = 0, k=0x424242; i < KEYS; ++i)
    semget(k++, 1, IPC_CREAT | 0600);

    total total max single max single
    KEYS without with call without call with

    1 3.5 4.9 µs 3.5 4.9
    10 7.6 8.6 µs 3.7 4.7
    32 16.2 15.9 µs 4.3 5.3
    100 72.9 41.8 µs 3.7 4.7
    1000 5,630.0 502.0 µs * *
    10000 1,340,000.0 7,240.0 µs * *
    31900 17,600,000.0 22,200.0 µs * *

    *: unreliable measure: high variance

    The duration for a lookup-only usage was obtained by the same loop once
    the keys are present:

    total total max single max single
    KEYS without with call without call with

    1 2.1 2.5 µs 2.1 2.5
    10 4.5 4.8 µs 2.2 2.3
    32 13.0 10.8 µs 2.3 2.8
    100 82.9 25.1 µs * 2.3
    1000 5,780.0 217.0 µs * *
    10000 1,470,000.0 2,520.0 µs * *
    31900 17,400,000.0 7,810.0 µs * *

    Finally, executing each semget() in a new process gave, when still
    summing only the durations of these syscalls:

    creation:
    total total
    KEYS without with

    1 3.7 5.0 µs
    10 32.9 36.7 µs
    32 125.0 109.0 µs
    100 523.0 353.0 µs
    1000 20,300.0 3,280.0 µs
    10000 2,470,000.0 46,700.0 µs
    31900 27,800,000.0 219,000.0 µs

    lookup-only:
    total total
    KEYS without with

    1 2.5 2.7 µs
    10 25.4 24.4 µs
    32 106.0 72.6 µs
    100 591.0 352.0 µs
    1000 22,400.0 2,250.0 µs
    10000 2,510,000.0 25,700.0 µs
    31900 28,200,000.0 115,000.0 µs

    [1] http://lkml.kernel.org/r/20170814060507.GE23258@yexl-desktop

    Link: http://lkml.kernel.org/r/20170815194954.ck32ta2z35yuzpwp@debix
    Signed-off-by: Guillaume Knispel
    Reviewed-by: Marc Pardo
    Cc: Davidlohr Bueso
    Cc: Kees Cook
    Cc: Manfred Spraul
    Cc: Alexey Dobriyan
    Cc: "Eric W. Biederman"
    Cc: "Peter Zijlstra (Intel)"
    Cc: Ingo Molnar
    Cc: Sebastian Andrzej Siewior
    Cc: Serge Hallyn
    Cc: Andrey Vagin
    Cc: Guillaume Knispel
    Cc: Marc Pardo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Guillaume Knispel
     
  • Replacing semop()'s kmalloc for kvmalloc was originally proposed by
    Manfred on the premise that it can be called for large (than order-1)
    sizes. For example, while Oracle recommends setting SEMOPM to a _minimum_
    of 100, some distros[1] encourage the setting to be a factor of the amount
    of db tasks (PROCESSES), which can get fishy for large systems (easily
    going beyond 1000).

    [1] An Example of Semaphore Settings
    https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/5/html/Tuning_and_Optimizing_Red_Hat_Enterprise_Linux_for_Oracle_9i_and_10g_Databases/sect-Oracle_9i_and_10g_Tuning_Guide-Setting_Semaphores-An_Example_of_Semaphore_Settings.html

    So let's just convert this to kvmalloc, just like the rest of the
    allocations we do in ipc. While the fallback vmalloc obviously involves
    more overhead, this by far the uncommon path, and it's better for the user
    than just erroring out with kmalloc.

    Link: http://lkml.kernel.org/r/20170803184136.13855-2-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • ... 'tis not used.

    Link: http://lkml.kernel.org/r/20170803184136.13855-1-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • refcount_t type and corresponding API should be used instead of atomic_t
    when the variable is used as a reference counter. This allows to avoid
    accidental refcounter overflows that might lead to use-after-free
    situations.

    Link: http://lkml.kernel.org/r/1499417992-3238-3-git-send-email-elena.reshetova@intel.com
    Signed-off-by: Elena Reshetova
    Signed-off-by: Hans Liljestrand
    Signed-off-by: Kees Cook
    Signed-off-by: David Windsor
    Cc: Peter Zijlstra
    Cc: Greg Kroah-Hartman
    Cc: "Eric W. Biederman"
    Cc: Ingo Molnar
    Cc: Alexey Dobriyan
    Cc: Serge Hallyn
    Cc:
    Cc: Davidlohr Bueso
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Elena Reshetova
     

04 Sep, 2017

2 commits

  • time_t is not y2038 safe. Replace all uses of
    time_t by y2038 safe time64_t.

    Similarly, replace the calls to get_seconds() with
    y2038 safe ktime_get_real_seconds().
    Note that this preserves fast access on 64 bit systems,
    but 32 bit systems need sequence counters.

    The syscall interface themselves are not changed as part of
    the patch. They will be part of a different series.

    Signed-off-by: Deepa Dinamani
    Reviewed-by: Arnd Bergmann
    Signed-off-by: Al Viro

    Deepa Dinamani
     
  • struct timespec is not y2038 safe on 32 bit machines.
    Replace timespec with y2038 safe struct timespec64.

    Note that the patch only changes the internals without
    modifying the syscall interface. This will be part
    of a separate series.

    Signed-off-by: Deepa Dinamani
    Reviewed-by: Arnd Bergmann
    Signed-off-by: Al Viro

    Deepa Dinamani
     

21 Aug, 2017

1 commit


17 Aug, 2017

1 commit

  • There is no agreed-upon definition of spin_unlock_wait()'s semantics,
    and it appears that all callers could do just as well with a lock/unlock
    pair. This commit therefore replaces the spin_unlock_wait() call in
    exit_sem() with spin_lock() followed immediately by spin_unlock().
    This should be safe from a performance perspective because exit_sem()
    is rarely invoked in production.

    Signed-off-by: Paul E. McKenney
    Cc: Andrew Morton
    Cc: Davidlohr Bueso
    Cc: Will Deacon
    Cc: Peter Zijlstra
    Cc: Alan Stern
    Cc: Andrea Parri
    Cc: Linus Torvalds
    Acked-by: Manfred Spraul

    Paul E. McKenney
     

03 Aug, 2017

1 commit

  • When building with the randstruct gcc plugin, the layout of the IPC
    structs will be randomized, which requires any sub-structure accesses to
    use container_of(). The proc display handlers were missing the needed
    container_of()s since the iterator is passing in the top-level struct
    kern_ipc_perm.

    This would lead to crashes when running the "lsipc" program after the
    system had IPC registered (e.g. after starting up Gnome):

    general protection fault: 0000 [#1] PREEMPT SMP
    ...
    RIP: 0010:shm_add_rss_swap.isra.1+0x13/0xa0
    ...
    Call Trace:
    sysvipc_shm_proc_show+0x5e/0x150
    sysvipc_proc_show+0x1a/0x30
    seq_read+0x2e9/0x3f0
    ...

    Link: http://lkml.kernel.org/r/20170730205950.GA55841@beast
    Fixes: 3859a271a003 ("randstruct: Mark various structs for randomization")
    Signed-off-by: Kees Cook
    Reported-by: Dominik Brodowski
    Acked-by: Davidlohr Bueso
    Acked-by: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     

16 Jul, 2017

3 commits


13 Jul, 2017

8 commits

  • The remaining users of __sem_free() can simply call kvfree() instead for
    better readability.

    [manfred@colorfullife.com: Rediff to keep rcu protection for security_sem_alloc()]
    Link: http://lkml.kernel.org/r/20170525185107.12869-20-manfred@colorfullife.com
    Signed-off-by: Kees Cook
    Signed-off-by: Manfred Spraul
    Cc: Davidlohr Bueso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • Only after ipc_addid() has succeeded will refcounting be used, so move
    initialization into ipc_addid() and remove from open-coded *_alloc()
    routines.

    Link: http://lkml.kernel.org/r/20170525185107.12869-17-manfred@colorfullife.com
    Signed-off-by: Kees Cook
    Signed-off-by: Manfred Spraul
    Cc: Davidlohr Bueso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • Loosely based on a patch from Kees Cook :
    - id and retval can be merged
    - if ipc_addid() fails, then use call_rcu() directly.

    The difference is that call_rcu is used for failed ipc_addid() calls, to
    continue to guaranteed an rcu delay for security_sem_free().

    Link: http://lkml.kernel.org/r/20170525185107.12869-14-manfred@colorfullife.com
    Signed-off-by: Manfred Spraul
    Cc: Kees Cook
    Cc: Davidlohr Bueso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • Instead of using ipc_rcu_alloc() which only performs the refcount bump,
    open code it to perform better sem-specific checks. This also allows
    for sem_array structure layout to be randomized in the future.

    [manfred@colorfullife.com: Rediff, because the memset was temporarily inside ipc_rcu_alloc()]
    Link: http://lkml.kernel.org/r/20170525185107.12869-10-manfred@colorfullife.com
    Signed-off-by: Kees Cook
    Signed-off-by: Manfred Spraul
    Cc: Davidlohr Bueso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • Avoid using ipc_rcu_free, since it just re-finds the original structure
    pointer. For the pre-list-init failure path, there is no RCU needed,
    since it was just allocated. It can be directly freed.

    Link: http://lkml.kernel.org/r/20170525185107.12869-6-manfred@colorfullife.com
    Signed-off-by: Kees Cook
    Signed-off-by: Manfred Spraul
    Cc: Davidlohr Bueso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • The only users of ipc_alloc() were ipc_rcu_alloc() and the on-heap
    sem_io fall-back memory. Better to just open-code these to make things
    easier to read.

    [manfred@colorfullife.com: Rediff due to inclusion of memset() into ipc_rcu_alloc()]
    Link: http://lkml.kernel.org/r/20170525185107.12869-5-manfred@colorfullife.com
    Signed-off-by: Kees Cook
    Signed-off-by: Manfred Spraul
    Cc: Davidlohr Bueso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • ipc has two management structures that exist for every id:
    - struct kern_ipc_perm, it contains e.g. the permissions.
    - struct ipc_rcu, it contains the rcu head for rcu handling and the
    refcount.

    The patch merges both structures.

    As a bonus, we may save one cacheline, because both structures are
    cacheline aligned. In addition, it reduces the number of casts, instead
    most codepaths can use container_of.

    To simplify code, the ipc_rcu_alloc initializes the allocation to 0.

    [manfred@colorfullife.com: really include the memset() into ipc_alloc_rcu()]
    Link: http://lkml.kernel.org/r/564f8612-0601-b267-514f-a9f650ec9b32@colorfullife.com
    Link: http://lkml.kernel.org/r/20170525185107.12869-3-manfred@colorfullife.com
    Signed-off-by: Manfred Spraul
    Cc: Davidlohr Bueso
    Cc: Kees Cook
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • sma->sem_base is initialized with

    sma->sem_base = (struct sem *) &sma[1];

    The current code has four problems:
    - There is an unnecessary pointer dereference - sem_base is not needed.
    - Alignment for struct sem only works by chance.
    - The current code causes false positive for static code analysis.
    - This is a cast between different non-void types, which the future
    randstruct GCC plugin warns on.

    And, as bonus, the code size gets smaller:

    Before:
    0 .text 00003770
    After:
    0 .text 0000374e

    [manfred@colorfullife.com: s/[0]/[]/, per hch]
    Link: http://lkml.kernel.org/r/20170525185107.12869-2-manfred@colorfullife.com
    Link: http://lkml.kernel.org/r/20170515171912.6298-2-manfred@colorfullife.com
    Signed-off-by: Manfred Spraul
    Acked-by: Kees Cook
    Cc: Kees Cook
    Cc:
    Cc: Davidlohr Bueso
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Fabian Frederick
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     

02 Mar, 2017

1 commit


28 Feb, 2017

2 commits

  • sysv sem has two lock modes: One with per-semaphore locks, one lock mode
    with a single global lock for the whole array. When switching from the
    per-semaphore locks to the global lock, all per-semaphore locks must be
    scanned for ongoing operations.

    The patch adds a hysteresis for switching from the global lock to the
    per semaphore locks. This reduces how often the per-semaphore locks
    must be scanned.

    Compared to the initial patch, this is a simplified solution: Setting
    USE_GLOBAL_LOCK_HYSTERESIS to 1 restores the current behavior.

    In theory, a workload with exactly 10 simple sops and then one complex
    op now scales a bit worse, but this is pure theory: If there is
    concurrency, the it won't be exactly 10:1:10:1:10:1:... If there is no
    concurrency, then there is no need for scalability.

    Link: http://lkml.kernel.org/r/1476851896-3590-3-git-send-email-manfred@colorfullife.com
    Signed-off-by: Manfred Spraul
    Cc: Peter Zijlstra
    Cc: Davidlohr Bueso
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: H. Peter Anvin
    Cc:
    Cc: kernel test robot
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • a) The ACQUIRE in spin_lock() applies to the read, not to the store, at
    least for powerpc. This forces to add a smp_mb() into the fast path.

    b) The memory barrier provided by spin_unlock_wait() is right now arch
    dependent.

    Therefore: Use spin_lock()/spin_unlock() instead of spin_unlock_wait().

    Advantage: faster single op semop calls(), observed +8.9% on x86. (the
    other solution would be arch dependencies in ipc/sem).

    Disadvantage: slower complex op semop calls, if (and only if) there are
    no sleeping operations.

    The next patch adds hysteresis, this further reduces the probability
    that the slow path is used.

    Link: http://lkml.kernel.org/r/1476851896-3590-2-git-send-email-manfred@colorfullife.com
    Signed-off-by: Manfred Spraul
    Cc: Peter Zijlstra
    Cc: Davidlohr Bueso
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: H. Peter Anvin
    Cc:
    Cc: kernel test robot
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     

11 Jan, 2017

1 commit

  • Based on the syzcaller test case from dvyukov:

    https://gist.githubusercontent.com/dvyukov/d0e5efefe4d7d6daed829f5c3ca26a40/raw/08d0a261fe3c987bed04fbf267e08ba04bd533ea/gistfile1.txt

    The slow (i.e.: failure to acquire) syscall exit from semtimedop()
    incorrectly assumed that the the same lock is acquired as it was at the
    initial syscall entry.

    This is wrong:
    - thread A: single semop semop(), sleeps
    - thread B: multi semop semop(), sleeps
    - thread A: woken up by signal/timeout

    With this sequence, the initial sem_lock() call locks the per-semaphore
    spinlock, and it is unlocked with sem_unlock(). The call at the syscall
    return locks the global spinlock. Because locknum is not updated, the
    following sem_unlock() call unlocks the per-semaphore spinlock, which is
    actually not locked.

    The fix is trivial: Use the return value from sem_lock.

    Fixes: 370b262c896e ("ipc/sem: avoid idr tree lookup for interrupted semop")
    Link: http://lkml.kernel.org/r/1482215645-22328-1-git-send-email-manfred@colorfullife.com
    Signed-off-by: Manfred Spraul
    Reported-by: Dmitry Vyukov
    Reported-by: Johanna Abrahamsson
    Tested-by: Johanna Abrahamsson
    Acked-by: Davidlohr Bueso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     

15 Dec, 2016

7 commits

  • We can avoid the idr tree lookup (albeit possibly avoiding
    idr_find_fast()) when being awoken in EINTR, as the semid will not
    change in this context while blocked. Use the sma pointer directly and
    take the sem_lock, then re-check for RMID races. We continue to
    re-check the queue.status with the lock held such that we can detect
    situations where we where are dealing with a spurious wakeup but another
    task that holds the sem_lock updated the queue.status while we were
    spinning for it. Once we take the lock it obviously won't change again.

    Being the only caller, get rid of sem_obtain_lock() altogether.

    Link: http://lkml.kernel.org/r/1478708774-28826-3-git-send-email-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • Instead of using the reverse goto, we can simplify the flow and make it
    more language natural by just doing do-while instead. One would hope
    this is the standard way (or obviously just with a while bucle) that we
    do wait/wakeup handling in the kernel. The exact same logic is kept,
    just more indented.

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/1478708774-28826-2-git-send-email-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • ... saves some LoC and looks cleaner than re-implementing the calls.

    Link: http://lkml.kernel.org/r/1474225896-10066-6-git-send-email-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Acked-by: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • The compiler already does this, but make it explicit. This helper is
    really small and also used in update_queue's main loop, which is O(N^2)
    scanning. Inline and avoid the function overhead.

    Link: http://lkml.kernel.org/r/1474225896-10066-5-git-send-email-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • This is the main workhorse that deals with semop user calls such that
    the waitforzero or semval update operations, on the set, can complete on
    not as the sma currently stands. Currently, the set is iterated twice
    (setting semval, then backwards for the sempid value). Slowpaths, and
    particularly SEM_UNDO calls, must undo any altered sem when it is
    detected that the caller must block or has errored-out.

    With larger sets, there can occur situations where this involves a lot
    of cycles and can obviously be a suboptimal use of cached resources in
    shared memory. Ie, discarding CPU caches that are also calling semop
    and have the sembuf cached (and can complete), while the current lock
    holder doing the semop will block, error, or does a waitforzero
    operation.

    This patch proposes still iterating the set twice, but the first scan is
    read-only, and we perform the actual updates afterward, once we know
    that the call will succeed. In order to not suffer from the overhead of
    dealing with sops that act on the same sem_num, such (rare) cases use
    perform_atomic_semop_slow(), which is exactly what we have now.
    Duplicates are detected before grabbing sem_lock, and uses simple a
    32/64-bit hash array variable to based on the sem_num we are working on.

    In addition add some comments to when we expect to the caller to block.

    [akpm@linux-foundation.org: coding-style fixes]
    [colin.king@canonical.com: ensure we left shift a ULL rather than a 32 bit integer]
    Link: http://lkml.kernel.org/r/20161028181129.7311-1-colin.king@canonical.com
    Link: http://lkml.kernel.org/r/20160921194603.GB21438@linux-80c1.suse
    Signed-off-by: Davidlohr Bueso
    Cc: Manfred Spraul
    Signed-off-by: Colin Ian King
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • Our sysv sems have been using the notion of lockless wakeups for a
    while, ever since commit 0a2b9d4c7967 ("ipc/sem.c: move wake_up_process
    out of the spinlock section"), in order to reduce the sem_lock hold
    times. This in-house pending queue can be replaced by wake_q (just like
    all the rest of ipc now), in that it provides the following advantages:

    o Simplifies and gets rid of unnecessary code.

    o We get rid of the IN_WAKEUP complexities. Given that wake_q_add()
    grabs reference to the task, if awoken due to an unrelated event,
    between the wake_q_add() and wake_up_q() window, we cannot race with
    sys_exit and the imminent call to wake_up_process().

    o By not spinning IN_WAKEUP, we no longer need to disable preemption.

    In consequence, the wakeup paths (after schedule(), that is) must
    acknowledge an external signal/event, as well spurious wakeup occurring
    during the pending wakeup window. Obviously no changes in semantics
    that could be visible to the user. The fastpath is _only_ for when we
    know for sure that we were awoken due to a the waker's successful semop
    call (queue.status is not -EINTR).

    On a 48-core Haswell, running the ipcscale 'waitforzero' test, the
    following is seen with increasing thread counts:

    v4.8-rc5 v4.8-rc5
    semopv2
    Hmean sembench-sem-2 574733.00 ( 0.00%) 578322.00 ( 0.62%)
    Hmean sembench-sem-8 811708.00 ( 0.00%) 824689.00 ( 1.59%)
    Hmean sembench-sem-12 842448.00 ( 0.00%) 845409.00 ( 0.35%)
    Hmean sembench-sem-21 933003.00 ( 0.00%) 977748.00 ( 4.80%)
    Hmean sembench-sem-48 935910.00 ( 0.00%) 1004759.00 ( 7.36%)
    Hmean sembench-sem-79 937186.00 ( 0.00%) 983976.00 ( 4.99%)
    Hmean sembench-sem-234 974256.00 ( 0.00%) 1060294.00 ( 8.83%)
    Hmean sembench-sem-265 975468.00 ( 0.00%) 1016243.00 ( 4.18%)
    Hmean sembench-sem-296 991280.00 ( 0.00%) 1042659.00 ( 5.18%)
    Hmean sembench-sem-327 975415.00 ( 0.00%) 1029977.00 ( 5.59%)
    Hmean sembench-sem-358 1014286.00 ( 0.00%) 1049624.00 ( 3.48%)
    Hmean sembench-sem-389 972939.00 ( 0.00%) 1043127.00 ( 7.21%)
    Hmean sembench-sem-420 981909.00 ( 0.00%) 1056747.00 ( 7.62%)
    Hmean sembench-sem-451 990139.00 ( 0.00%) 1051609.00 ( 6.21%)
    Hmean sembench-sem-482 965735.00 ( 0.00%) 1040313.00 ( 7.72%)

    [akpm@linux-foundation.org: coding-style fixes]
    [sfr@canb.auug.org.au: merge fix for WAKE_Q to DEFINE_WAKE_Q rename]
    Link: http://lkml.kernel.org/r/20161122210410.5eca9fc2@canb.auug.org.au
    Link: http://lkml.kernel.org/r/1474225896-10066-3-git-send-email-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Acked-by: Manfred Spraul
    Signed-off-by: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • …iously be paired with its _prepare()

    counterpart. At least whenever possible, as there is no harm in calling
    it bogusly as we do now in a few places. Immediate error semop(2) paths
    that are far from ever having the task block can be simplified and avoid
    a few unnecessary loads on their way out of the call as it is not deeply
    nested.

    Link: http://lkml.kernel.org/r/1474225896-10066-2-git-send-email-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
    Cc: Manfred Spraul <manfred@colorfullife.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Davidlohr Bueso
     

12 Oct, 2016

2 commits

  • In CONFIG_PREEMPT=n kernel a softlockup was observed while the for loop in
    exit_sem. Apparently it's possible for the loop to take quite a long time
    and it doesn't have a scheduling point in it. Since the codes is
    executing under an rcu read section this may also cause rcu stalls, which
    in turn block synchronize_rcu operations, which more or less de-stabilises
    the whole system.

    Fix this by introducing a cond_resched() at the beginning of the loop.

    So this patch fixes the following:

    NMI watchdog: BUG: soft lockup - CPU#10 stuck for 23s! [httpd:18119]
    CPU: 10 PID: 18119 Comm: httpd Tainted: G O 4.4.20-clouder2 #6
    Hardware name: Supermicro X10DRi/X10DRi, BIOS 1.1 04/14/2015
    task: ffff88348d695280 ti: ffff881c95550000 task.ti: ffff881c95550000
    RIP: 0010:[] [] _raw_spin_lock+0x17/0x30
    RSP: 0018:ffff881c95553e40 EFLAGS: 00000246
    RAX: 0000000000000000 RBX: ffff883161b1eea8 RCX: 000000000000000d
    RDX: 0000000000000001 RSI: 000000000000000e RDI: ffff883161b1eea4
    RBP: ffff881c95553ea0 R08: ffff881c95553e68 R09: ffff883fef376f88
    R10: ffff881fffb58c20 R11: ffffea0072556600 R12: ffff883161b1eea0
    R13: ffff88348d695280 R14: ffff883dec427000 R15: ffff8831621672a0
    FS: 0000000000000000(0000) GS:ffff881fffb40000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007f3b3723e020 CR3: 0000000001c0a000 CR4: 00000000001406e0
    Call Trace:
    ? exit_sem+0x7c/0x280
    do_exit+0x338/0xb40
    do_group_exit+0x43/0xd0
    SyS_exit_group+0x14/0x20
    entry_SYSCALL_64_fastpath+0x16/0x6e

    Link: http://lkml.kernel.org/r/1475154992-6363-1-git-send-email-kernel@kyup.com
    Signed-off-by: Nikolay Borisov
    Cc: Herton R. Krzesinski
    Cc: Fabian Frederick
    Cc: Davidlohr Bueso
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nikolay Borisov
     
  • Commit 6d07b68ce16a ("ipc/sem.c: optimize sem_lock()") introduced a
    race:

    sem_lock has a fast path that allows parallel simple operations.
    There are two reasons why a simple operation cannot run in parallel:
    - a non-simple operations is ongoing (sma->sem_perm.lock held)
    - a complex operation is sleeping (sma->complex_count != 0)

    As both facts are stored independently, a thread can bypass the current
    checks by sleeping in the right positions. See below for more details
    (or kernel bugzilla 105651).

    The patch fixes that by creating one variable (complex_mode)
    that tracks both reasons why parallel operations are not possible.

    The patch also updates stale documentation regarding the locking.

    With regards to stable kernels:
    The patch is required for all kernels that include the
    commit 6d07b68ce16a ("ipc/sem.c: optimize sem_lock()") (3.10?)

    The alternative is to revert the patch that introduced the race.

    The patch is safe for backporting, i.e. it makes no assumptions
    about memory barriers in spin_unlock_wait().

    Background:
    Here is the race of the current implementation:

    Thread A: (simple op)
    - does the first "sma->complex_count == 0" test

    Thread B: (complex op)
    - does sem_lock(): This includes an array scan. But the scan can't
    find Thread A, because Thread A does not own sem->lock yet.
    - the thread does the operation, increases complex_count,
    drops sem_lock, sleeps

    Thread A:
    - spin_lock(&sem->lock), spin_is_locked(sma->sem_perm.lock)
    - sleeps before the complex_count test

    Thread C: (complex op)
    - does sem_lock (no array scan, complex_count==1)
    - wakes up Thread B.
    - decrements complex_count

    Thread A:
    - does the complex_count test

    Bug:
    Now both thread A and thread C operate on the same array, without
    any synchronization.

    Fixes: 6d07b68ce16a ("ipc/sem.c: optimize sem_lock()")
    Link: http://lkml.kernel.org/r/1469123695-5661-1-git-send-email-manfred@colorfullife.com
    Reported-by:
    Cc: "H. Peter Anvin"
    Cc: Peter Zijlstra
    Cc: Davidlohr Bueso
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc:
    Cc: [3.10+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     

03 Aug, 2016

1 commit

  • Commit 53dad6d3a8e5 ("ipc: fix race with LSMs") updated ipc_rcu_putref()
    to receive rcu freeing function but used generic ipc_rcu_free() instead
    of msg_rcu_free() which does security cleaning.

    Running LTP msgsnd06 with kmemleak gives the following:

    cat /sys/kernel/debug/kmemleak

    unreferenced object 0xffff88003c0a11f8 (size 8):
    comm "msgsnd06", pid 1645, jiffies 4294672526 (age 6.549s)
    hex dump (first 8 bytes):
    1b 00 00 00 01 00 00 00 ........
    backtrace:
    kmemleak_alloc+0x23/0x40
    kmem_cache_alloc_trace+0xe1/0x180
    selinux_msg_queue_alloc_security+0x3f/0xd0
    security_msg_queue_alloc+0x2e/0x40
    newque+0x4e/0x150
    ipcget+0x159/0x1b0
    SyS_msgget+0x39/0x40
    entry_SYSCALL_64_fastpath+0x13/0x8f

    Manfred Spraul suggested to fix sem.c as well and Davidlohr Bueso to
    only use ipc_rcu_free in case of security allocation failure in newary()

    Fixes: 53dad6d3a8e ("ipc: fix race with LSMs")
    Link: http://lkml.kernel.org/r/1470083552-22966-1-git-send-email-fabf@skynet.be
    Signed-off-by: Fabian Frederick
    Cc: Davidlohr Bueso
    Cc: Manfred Spraul
    Cc: [3.12+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     

14 Jun, 2016

2 commits

  • With the modified semantics of spin_unlock_wait() a number of
    explicit barriers can be removed. Also update the comment for the
    do_exit() usecase, as that was somewhat stale/obscure.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Introduce smp_acquire__after_ctrl_dep(), this construct is not
    uncommon, but the lack of this barrier is.

    Use it to better express smp_rmb() uses in WRITE_ONCE(), the IPC
    semaphore code and the qspinlock code.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

23 Mar, 2016

1 commit

  • As indicated by bug#112271, Linux sets the sempid value upon semctl, and
    not only for semop calls. However, within semctl we only do this for
    SETVAL, leaving SETALL without updating the field, and therefore rather
    inconsistent behavior when compared to other Unices.

    There is really no documentation regarding this and therefore users
    should not make assumptions. With this patch, along with updating
    semctl.2 manpages, this scenario should become less ambiguous As such,
    set sempid on SETALL cmd.

    Also update some in-code documentation, specifying where the sempid is
    set.

    Passes ltp and custom testcase where a child (fork) does SETALL to the
    set.

    Signed-off-by: Davidlohr Bueso
    Reported-by: Philip Semanchuk
    Cc: Michael Kerrisk
    Cc: PrasannaKumar Muralidharan
    Cc: Manfred Spraul
    Cc: Herton R. Krzesinski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso