03 Nov, 2011

3 commits

  • include/linux/sem.h contains several structures that are only used within
    ipc/sem.c.

    The patch moves them into ipc/sem.c - there is no need to expose the
    structures to the whole kernel.

    No functional changes, only whitespace cleanups and 80-char per line
    fixes.

    Signed-off-by: Manfred Spraul
    Acked-by: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Mike Galbraith
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • semtimedop() does not handle spurious wakeups, it returns -EINTR to user
    space. Most other schedule() users would just loop and not return to user
    space. The patch adds such a loop to semtimedop()

    Signed-off-by: Manfred Spraul
    Reported-by: Peter Zijlstra
    Acked-by: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Mike Galbraith
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • sys_semtimedop() may return -EIDRM although the semaphore operation
    completed successfully:

    thread 1: thread 2:
    semtimedop(), sleeps
    semop():
    * acquires sem_lock()
    semtimedop() woken up due to timeout
    sem_lock() loops
    * notices that thread 2 could be completed.
    * performs the operations that thread 2 is sleeping on.
    * marks the semaphore operation as IN_WAKEUP
    * drops sem_lock(), does wakeup, sets return code to 0
    * thread delayed due to interrupt, whatever
    * returns to user space
    * thread still delayed
    semctl(IPC_RMID)
    * acquires sem_lock()
    * ipc_rmid(), ipcp->deleted=1
    * drops sem_lock()
    * thread finally continues - but seem_lock()
    now fails due to ipcp->deleted == 1
    * returns -EIDRM instead of 0

    The fix is trivial: Always use the return code in queue.status.

    In real world, the race probably doesn't matter:
    If the semaphore array is destroyed, the app is probably not interested
    if the last operation succeeded or was already cancelled.

    Signed-off-by: Manfred Spraul
    Cc: Thomas Gleixner
    Cc: Mike Galbraith
    Acked-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     

26 Jul, 2011

1 commit

  • If a semaphore array is removed and in parallel a sleeping task is woken
    up (signal or timeout, does not matter), then the woken up task does not
    wait until wake_up_sem_queue_do() is completed. This will cause crashes,
    because wake_up_sem_queue_do() will read from a stale pointer.

    The fix is simple: Regardless of anything, always call get_queue_result().
    This function waits until wake_up_sem_queue_do() has finished it's task.

    Addresses https://bugzilla.kernel.org/show_bug.cgi?id=27142

    Reported-by: Yuriy Yevtukhov
    Reported-by: Harald Laabs
    Signed-off-by: Manfred Spraul
    Acked-by: Eric Dumazet
    Cc: [2.6.35+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     

21 Jul, 2011

1 commit


31 Mar, 2011

1 commit


24 Mar, 2011

1 commit

  • CAP_IPC_OWNER and CAP_IPC_LOCK can be checked against current_user_ns(),
    because the resource comes from current's own ipc namespace.

    setuid/setgid are to uids in own namespace, so again checks can be against
    current_user_ns().

    Changelog:
    Jan 11: Use task_ns_capable() in place of sched_capable().
    Jan 11: Use nsown_capable() as suggested by Bastian Blank.
    Jan 11: Clarify (hopefully) some logic in futex and sched.c
    Feb 15: use ns_capable for ipc, not nsown_capable
    Feb 23: let copy_ipcs handle setting ipc_ns->user_ns
    Feb 23: pass ns down rather than taking it from current

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Serge E. Hallyn
    Acked-by: "Eric W. Biederman"
    Acked-by: Daniel Lezcano
    Acked-by: David Howells
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     

02 Oct, 2010

1 commit

  • The semctl syscall has several code paths that lead to the leakage of
    uninitialized kernel stack memory (namely the IPC_INFO, SEM_INFO,
    IPC_STAT, and SEM_STAT commands) during the use of the older, obsolete
    version of the semid_ds struct.

    The copy_semid_to_user() function declares a semid_ds struct on the stack
    and copies it back to the user without initializing or zeroing the
    "sem_base", "sem_pending", "sem_pending_last", and "undo" pointers,
    allowing the leakage of 16 bytes of kernel stack memory.

    The code is still reachable on 32-bit systems - when calling semctl()
    newer glibc's automatically OR the IPC command with the IPC_64 flag, but
    invoking the syscall directly allows users to use the older versions of
    the struct.

    Signed-off-by: Dan Rosenberg
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Rosenberg
     

21 Jul, 2010

1 commit

  • The last change to improve the scalability moved the actual wake-up out of
    the section that is protected by spin_lock(sma->sem_perm.lock).

    This means that IN_WAKEUP can be in queue.status even when the spinlock is
    acquired by the current task. Thus the same loop that is performed when
    queue.status is read without the spinlock acquired must be performed when
    the spinlock is acquired.

    Thanks to kamezawa.hiroyu@jp.fujitsu.com for noticing lack of the memory
    barrier.

    Addresses https://bugzilla.kernel.org/show_bug.cgi?id=16255

    [akpm@linux-foundation.org: clean up kerneldoc, checkpatch warning and whitespace]
    Signed-off-by: Manfred Spraul
    Reported-by: Luca Tettamanti
    Tested-by: Luca Tettamanti
    Reported-by: Christoph Lameter
    Cc: Maciej Rutecki
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     

28 May, 2010

4 commits

  • Use ERR_CAST(x) rather than ERR_PTR(PTR_ERR(x)). The former makes more
    clear what is the purpose of the operation, which otherwise looks like a
    no-op.

    The semantic patch that makes this change is as follows:
    (http://coccinelle.lip6.fr/)

    //
    @@
    type T;
    T x;
    identifier f;
    @@

    T f (...) { }

    @@
    expression x;
    @@

    - ERR_PTR(PTR_ERR(x))
    + ERR_CAST(x)
    //

    Signed-off-by: Julia Lawall
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Julia Lawall
     
  • ipc/sem.c begins with a 15 year old description about bugs in the initial
    implementation in Linux-1.0. The patch replaces that with a top level
    description of the current code.

    A TODO could be derived from this text:

    The opengroup man page for semop() does not mandate FIFO. Thus there is
    no need for a semaphore array list of pending operations.

    If

    - this list is removed
    - the per-semaphore array spinlock is removed (possible if there is no
    list to protect)
    - sem_otime is moved into the semaphores and calculated on demand during
    semctl()

    then the array would be read-mostly - which would significantly improve
    scaling for applications that use semaphore arrays with lots of entries.

    The price would be expensive semctl() calls:

    for(i=0;isem_nsems;i++) spin_lock(sma->sem_lock);

    for(i=0;isem_nsems;i++) spin_unlock(sma->sem_lock);

    I'm not sure if the complexity is worth the effort, thus here is the
    documentation of the current behavior first.

    Signed-off-by: Manfred Spraul
    Cc: Chris Mason
    Cc: Zach Brown
    Cc: Jens Axboe
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • The wake-up part of semtimedop() consists out of two steps:

    - the right tasks must be identified.
    - they must be woken up.

    Right now, both steps run while the array spinlock is held. This patch
    reorders the code and moves the actual wake_up_process() behind the point
    where the spinlock is dropped.

    The code also moves setting sem->sem_otime to one place: It does not make
    sense to set the last modify time multiple times.

    [akpm@linux-foundation.org: repair kerneldoc]
    [akpm@linux-foundation.org: fix uninitialised retval]
    Signed-off-by: Manfred Spraul
    Cc: Chris Mason
    Cc: Zach Brown
    Cc: Jens Axboe
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • The following series of patches tries to fix the spinlock contention
    reported by Chris Mason - his benchmark exposes problems of the current
    code:

    - In the worst case, the algorithm used by update_queue() is O(N^2).
    Bulk wake-up calls can enter this worst case. The patch series fix
    that.

    Note that the benchmark app doesn't expose the problem, it just should
    be fixed: Real world apps might do the wake-ups in another order than
    perfect FIFO.

    - The part of the code that runs within the semaphore array spinlock is
    significantly larger than necessary.

    The patch series fixes that. This change is responsible for the main
    improvement.

    - The cacheline with the spinlock is also used for a variable that is
    read in the hot path (sem_base) and for a variable that is unnecessarily
    written to multiple times (sem_otime). The last step of the series
    cacheline-aligns the spinlock.

    This patch:

    The SysV semaphore code allows to perform multiple operations on all
    semaphores in the array as atomic operations. After a modification,
    update_queue() checks which of the waiting tasks can complete.

    The algorithm that is used to identify the tasks is O(N^2) in the worst
    case. For some cases, it is simple to avoid the O(N^2).

    The patch adds a detection logic for some cases, especially for the case
    of an array where all sleeping tasks are single sembuf operations and a
    multi-sembuf operation is used to wake up multiple tasks.

    A big database application uses that approach.

    The patch fixes wakeup due to semctl(,,SETALL,) - the initial version of
    the patch breaks that.

    [akpm@linux-foundation.org: make do_smart_update() static]
    Signed-off-by: Manfred Spraul
    Cc: Chris Mason
    Cc: Zach Brown
    Cc: Jens Axboe
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     

16 Dec, 2009

9 commits

  • This line is unreachable, remove it.

    [akpm@linux-foundation.org: remove unneeded initialisation of `err']
    Signed-off-by: WANG Cong
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Amerigo Wang
     
  • If multiple simple decrements on the same semaphore are pending, then the
    current code scans all decrement operations, even if the semaphore value
    is already 0.

    The patch optimizes that: if the semaphore value is 0, then there is no
    need to scan the q->alter entries.

    Note that this is a common case: It happens if 100 decrements by one are
    pending and now an increment by one increases the semaphore value from 0
    to 1. Without this patch, all 100 entries are scanned. With the patch,
    only one entry is scanned, then woken up. Then the new rule triggers and
    the scanning is aborted, without looking at the remaining 99 tasks.

    With this patch, single sop increment/decrement by 1 are now O(1).
    (same as with Nick's patch)

    Signed-off-by: Manfred Spraul
    Cc: Nick Piggin
    Cc: Pierre Peiffer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • sysv sem has the concept of semaphore arrays that consist out of multiple
    semaphores. Atomic operations that affect multiple semaphores are
    supported.

    The patch optimizes single semaphore operation calls that affect only one
    semaphore: It's not necessary to scan all pending operations, it is
    sufficient to scan the per-semaphore list.

    The idea is from Nick Piggin version of an ipc sem improvement, the
    implementation is different: The code tries to keep as much common code as
    possible.

    As the result, the patch is simpler, but optimizes fewer cases.

    Signed-off-by: Manfred Spraul
    Cc: Nick Piggin
    Cc: Pierre Peiffer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • Based on Nick's findings:

    sysv sem has the concept of semaphore arrays that consist out of multiple
    semaphores. Atomic operations that affect multiple semaphores are
    supported.

    The patch is the first step for optimizing simple, single semaphore
    operations: In addition to the global list of all pending operations, a
    2nd, per-semaphore list with the simple operations is added.

    Note: this patch does not make sense by itself, the new list is used
    nowhere.

    Signed-off-by: Manfred Spraul
    Cc: Nick Piggin
    Cc: Pierre Peiffer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • Reduce the amount of scanning of the list of pending semaphore operations:
    If try_atomic_semop failed, then no changes were applied. Thus no need to
    restart.

    Additionally, this patch correct an incorrect comment: It's possible to
    wait for arbitrary semaphore values (do a dec by , wait-for-zero, inc
    by in one atomic operation)

    Both changes are from Nick Piggin, the patch is the result of a different
    split of the individual changes.

    Signed-off-by: Manfred Spraul
    Cc: Nick Piggin
    Cc: Pierre Peiffer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • The strange sysv semaphore wakeup scheme has a kind of busy-wait lock
    involved, which could deadlock if preemption is enabled during the "lock".

    It is an implementation detail (due to a spinlock being held) that this is
    actually the case. However if "spinlocks" are made preemptible, or if the
    sem lock is changed to a sleeping lock for example, then the wakeup would
    become buggy. So this might be a bugfix for -rt kernels.

    Imagine waker being preempted by wakee and never clearing IN_WAKEUP -- if
    wakee has higher RT priority then there is a priority inversion deadlock.
    Even if there is not a priority inversion to cause a deadlock, then there
    is still time wasted spinning.

    Signed-off-by: Nick Piggin
    Signed-off-by: Manfred Spraul
    Cc: Pierre Peiffer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Replace the handcoded list operations in update_queue() with the standard
    list_for_each_entry macros.

    list_for_each_entry_safe() must be used, because list entries can
    disappear immediately uppon the wakeup event.

    Signed-off-by: Nick Piggin
    Signed-off-by: Manfred Spraul
    Cc: Pierre Peiffer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Around a month ago, there was some discussion about an improvement of the
    sysv sem algorithm: Most (at least: some important) users only use simple
    semaphore operations, therefore it's worthwile to optimize this use case.

    This patch:

    Move last looked up sem_undo struct to the head of the task's undo list.
    Attempt to move common entries to the front of the list so search time is
    reduced. This reduces lookup_undo on oprofile of problematic SAP workload
    by 30% (see patch 4 for a description of SAP workload).

    Signed-off-by: Nick Piggin
    Signed-off-by: Manfred Spraul
    Cc: Pierre Peiffer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • We have apparently had a memory leak since
    7ca7e564e049d8b350ec9d958ff25eaa24226352 "ipc: store ipcs into IDRs" in
    2007. The idr of which 3 exist for each ipc namespace is never freed.

    This patch simply frees them when the ipcns is freed. I don't believe any
    idr_remove() are done from rcu (and could therefore be delayed until after
    this idr_destroy()), so the patch should be safe. Some quick testing
    showed no harm, and the memory leak fixed.

    Caught by kmemleak.

    Signed-off-by: Serge E. Hallyn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     

15 Apr, 2009

1 commit


14 Jan, 2009

2 commits


07 Jan, 2009

1 commit


06 Jan, 2009

1 commit


17 Oct, 2008

1 commit


26 Jul, 2008

4 commits

  • The attached patch:
    - reverses the locking order of ulp->lock and sem_lock:
    Previously, it was first ulp->lock, then inside sem_lock.
    Now it's the other way around.
    - converts the undo structure to rcu.

    Benefits:
    - With the old locking order, IPC_RMID could not kfree the undo structures.
    The stale entries remained in the linked lists and were released later.
    - The patch fixes a a race in semtimedop(): if both IPC_RMID and a semget() that
    recreates exactly the same id happen between find_alloc_undo() and sem_lock,
    then semtimedop() would access already kfree'd memory.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Manfred Spraul
    Reviewed-by: Nadia Derbey
    Cc: Pierre Peiffer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • sem_array.sem_pending is a double linked list, the attached patch converts
    it to struct list_head.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Manfred Spraul
    Reviewed-by: Nadia Derbey
    Cc: Pierre Peiffer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • sem_queue.sma and sem_queue.id were never used, the attached patch removes
    them.

    Signed-off-by: Manfred Spraul
    Reviewed-by: Nadia Derbey
    Cc: Pierre Peiffer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • The undo structures contain two linked lists, the attached patch replaces
    them with generic struct list_head lists.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Manfred Spraul
    Cc: Nadia Derbey
    Cc: Pierre Peiffer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     

29 Apr, 2008

8 commits

  • sys_unshare(CLONE_NEWIPC) doesn't handle the undo lists properly, this can
    cause a kernel memory corruption. CLONE_NEWIPC must detach from the existing
    undo lists.

    Fix, part 1: add support for sys_unshare(CLONE_SYSVSEM)

    The original reason to not support it was the potential (inevitable?)
    confusion due to the fact that sys_unshare(CLONE_SYSVSEM) has the
    inverse meaning of clone(CLONE_SYSVSEM).

    Our two most reasonable options then appear to be (1) fully support
    CLONE_SYSVSEM, or (2) continue to refuse explicit CLONE_SYSVSEM,
    but always do it anyway on unshare(CLONE_SYSVSEM). This patch does
    (1).

    Changelog:
    Apr 16: SEH: switch to Manfred's alternative patch which
    removes the unshare_semundo() function which
    always refused CLONE_SYSVSEM.

    Signed-off-by: Manfred Spraul
    Signed-off-by: Serge E. Hallyn
    Acked-by: "Eric W. Biederman"
    Cc: Pavel Emelyanov
    Cc: Michael Kerrisk
    Cc: Pierre Peiffer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • semctl_down(), msgctl_down() and shmctl_down() are used to handle the same set
    of commands for each kind of IPC. They all start to do the same job (they
    retrieve the ipc and do some permission checks) before handling the commands
    on their own.

    This patch proposes to consolidate this by moving these same pieces of code
    into one common function called ipcctl_pre_down().

    It simplifies a little these xxxctl_down() functions and increases a little
    the maintainability.

    Signed-off-by: Pierre Peiffer
    Acked-by: Serge Hallyn
    Cc: Nadia Derbey
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pierre Peiffer
     
  • The IPC_SET command performs the same permission setting for all IPCs. This
    patch introduces a common ipc_update_perm() function to update these
    permissions and makes use of it for all IPCs.

    Signed-off-by: Pierre Peiffer
    Acked-by: Serge Hallyn
    Cc: Nadia Derbey
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pierre Peiffer
     
  • All IPCs make use of an intermetiate *_setbuf structure to handle the IPC_SET
    command. This is not really needed and, moreover, it complicates a little bit
    the code.

    This patch gets rid of the use of it and uses directly the semid64_ds/
    msgid64_ds/shmid64_ds structure.

    In addition of removing one struture declaration, it also simplifies and
    improves a little bit the common 64-bits path.

    Signed-off-by: Pierre Peiffer
    Acked-by: Serge Hallyn
    Cc: Nadia Derbey
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pierre Peiffer
     
  • semctl_down() takes one unused parameter: semnum. This patch proposes to get
    rid of it.

    Signed-off-by: Pierre Peiffer
    Acked-by: Serge Hallyn
    Cc: Nadia Derbey
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pierre Peiffer
     
  • semctl_down is called with the rwmutex (the one which protects the list of
    ipcs) taken in write mode.

    This patch moves this rwmutex taken in write-mode inside semctl_down.

    This has the advantages of reducing a little bit the window during which this
    rwmutex is taken, clarifying sys_semctl, and finally of having a coherent
    behaviour with [shm|msg]ctl_down

    Signed-off-by: Pierre Peiffer
    Acked-by: Serge Hallyn
    Cc: Nadia Derbey
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pierre Peiffer
     
  • Trivial patch which adds some small locking functions and makes use of them to
    factorize some part of the code and to make it cleaner.

    Signed-off-by: Pierre Peiffer
    Acked-by: Serge Hallyn
    Cc: Nadia Derbey
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pierre Peiffer
     
  • By continuing to consolidate a little the IPC code, each id can be built
    directly in ipc_addid() instead of having it built from each callers of
    ipc_addid()

    And I also remove shm_addid() in order to have, as much as possible, the
    same code for shm/sem/msg.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Pierre Peiffer
    Cc: Nadia Derbey
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pierre Peiffer