12 Oct, 2016

2 commits

  • In CONFIG_PREEMPT=n kernel a softlockup was observed while the for loop in
    exit_sem. Apparently it's possible for the loop to take quite a long time
    and it doesn't have a scheduling point in it. Since the codes is
    executing under an rcu read section this may also cause rcu stalls, which
    in turn block synchronize_rcu operations, which more or less de-stabilises
    the whole system.

    Fix this by introducing a cond_resched() at the beginning of the loop.

    So this patch fixes the following:

    NMI watchdog: BUG: soft lockup - CPU#10 stuck for 23s! [httpd:18119]
    CPU: 10 PID: 18119 Comm: httpd Tainted: G O 4.4.20-clouder2 #6
    Hardware name: Supermicro X10DRi/X10DRi, BIOS 1.1 04/14/2015
    task: ffff88348d695280 ti: ffff881c95550000 task.ti: ffff881c95550000
    RIP: 0010:[] [] _raw_spin_lock+0x17/0x30
    RSP: 0018:ffff881c95553e40 EFLAGS: 00000246
    RAX: 0000000000000000 RBX: ffff883161b1eea8 RCX: 000000000000000d
    RDX: 0000000000000001 RSI: 000000000000000e RDI: ffff883161b1eea4
    RBP: ffff881c95553ea0 R08: ffff881c95553e68 R09: ffff883fef376f88
    R10: ffff881fffb58c20 R11: ffffea0072556600 R12: ffff883161b1eea0
    R13: ffff88348d695280 R14: ffff883dec427000 R15: ffff8831621672a0
    FS: 0000000000000000(0000) GS:ffff881fffb40000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007f3b3723e020 CR3: 0000000001c0a000 CR4: 00000000001406e0
    Call Trace:
    ? exit_sem+0x7c/0x280
    do_exit+0x338/0xb40
    do_group_exit+0x43/0xd0
    SyS_exit_group+0x14/0x20
    entry_SYSCALL_64_fastpath+0x16/0x6e

    Link: http://lkml.kernel.org/r/1475154992-6363-1-git-send-email-kernel@kyup.com
    Signed-off-by: Nikolay Borisov
    Cc: Herton R. Krzesinski
    Cc: Fabian Frederick
    Cc: Davidlohr Bueso
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nikolay Borisov
     
  • Commit 6d07b68ce16a ("ipc/sem.c: optimize sem_lock()") introduced a
    race:

    sem_lock has a fast path that allows parallel simple operations.
    There are two reasons why a simple operation cannot run in parallel:
    - a non-simple operations is ongoing (sma->sem_perm.lock held)
    - a complex operation is sleeping (sma->complex_count != 0)

    As both facts are stored independently, a thread can bypass the current
    checks by sleeping in the right positions. See below for more details
    (or kernel bugzilla 105651).

    The patch fixes that by creating one variable (complex_mode)
    that tracks both reasons why parallel operations are not possible.

    The patch also updates stale documentation regarding the locking.

    With regards to stable kernels:
    The patch is required for all kernels that include the
    commit 6d07b68ce16a ("ipc/sem.c: optimize sem_lock()") (3.10?)

    The alternative is to revert the patch that introduced the race.

    The patch is safe for backporting, i.e. it makes no assumptions
    about memory barriers in spin_unlock_wait().

    Background:
    Here is the race of the current implementation:

    Thread A: (simple op)
    - does the first "sma->complex_count == 0" test

    Thread B: (complex op)
    - does sem_lock(): This includes an array scan. But the scan can't
    find Thread A, because Thread A does not own sem->lock yet.
    - the thread does the operation, increases complex_count,
    drops sem_lock, sleeps

    Thread A:
    - spin_lock(&sem->lock), spin_is_locked(sma->sem_perm.lock)
    - sleeps before the complex_count test

    Thread C: (complex op)
    - does sem_lock (no array scan, complex_count==1)
    - wakes up Thread B.
    - decrements complex_count

    Thread A:
    - does the complex_count test

    Bug:
    Now both thread A and thread C operate on the same array, without
    any synchronization.

    Fixes: 6d07b68ce16a ("ipc/sem.c: optimize sem_lock()")
    Link: http://lkml.kernel.org/r/1469123695-5661-1-git-send-email-manfred@colorfullife.com
    Reported-by:
    Cc: "H. Peter Anvin"
    Cc: Peter Zijlstra
    Cc: Davidlohr Bueso
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc:
    Cc: [3.10+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     

03 Aug, 2016

1 commit

  • Commit 53dad6d3a8e5 ("ipc: fix race with LSMs") updated ipc_rcu_putref()
    to receive rcu freeing function but used generic ipc_rcu_free() instead
    of msg_rcu_free() which does security cleaning.

    Running LTP msgsnd06 with kmemleak gives the following:

    cat /sys/kernel/debug/kmemleak

    unreferenced object 0xffff88003c0a11f8 (size 8):
    comm "msgsnd06", pid 1645, jiffies 4294672526 (age 6.549s)
    hex dump (first 8 bytes):
    1b 00 00 00 01 00 00 00 ........
    backtrace:
    kmemleak_alloc+0x23/0x40
    kmem_cache_alloc_trace+0xe1/0x180
    selinux_msg_queue_alloc_security+0x3f/0xd0
    security_msg_queue_alloc+0x2e/0x40
    newque+0x4e/0x150
    ipcget+0x159/0x1b0
    SyS_msgget+0x39/0x40
    entry_SYSCALL_64_fastpath+0x13/0x8f

    Manfred Spraul suggested to fix sem.c as well and Davidlohr Bueso to
    only use ipc_rcu_free in case of security allocation failure in newary()

    Fixes: 53dad6d3a8e ("ipc: fix race with LSMs")
    Link: http://lkml.kernel.org/r/1470083552-22966-1-git-send-email-fabf@skynet.be
    Signed-off-by: Fabian Frederick
    Cc: Davidlohr Bueso
    Cc: Manfred Spraul
    Cc: [3.12+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     

14 Jun, 2016

2 commits

  • With the modified semantics of spin_unlock_wait() a number of
    explicit barriers can be removed. Also update the comment for the
    do_exit() usecase, as that was somewhat stale/obscure.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Introduce smp_acquire__after_ctrl_dep(), this construct is not
    uncommon, but the lack of this barrier is.

    Use it to better express smp_rmb() uses in WRITE_ONCE(), the IPC
    semaphore code and the qspinlock code.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

23 Mar, 2016

1 commit

  • As indicated by bug#112271, Linux sets the sempid value upon semctl, and
    not only for semop calls. However, within semctl we only do this for
    SETVAL, leaving SETALL without updating the field, and therefore rather
    inconsistent behavior when compared to other Unices.

    There is really no documentation regarding this and therefore users
    should not make assumptions. With this patch, along with updating
    semctl.2 manpages, this scenario should become less ambiguous As such,
    set sempid on SETALL cmd.

    Also update some in-code documentation, specifying where the sempid is
    set.

    Passes ltp and custom testcase where a child (fork) does SETALL to the
    set.

    Signed-off-by: Davidlohr Bueso
    Reported-by: Philip Semanchuk
    Cc: Michael Kerrisk
    Cc: PrasannaKumar Muralidharan
    Cc: Manfred Spraul
    Cc: Herton R. Krzesinski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     

23 Jan, 2016

1 commit

  • There are many locations that do

    if (memory_was_allocated_by_vmalloc)
    vfree(ptr);
    else
    kfree(ptr);

    but kvfree() can handle both kmalloc()ed memory and vmalloc()ed memory
    using is_vmalloc_addr(). Unless callers have special reasons, we can
    replace this branch with kvfree(). Please check and reply if you found
    problems.

    Signed-off-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Acked-by: Jan Kara
    Acked-by: Russell King
    Reviewed-by: Andreas Dilger
    Acked-by: "Rafael J. Wysocki"
    Acked-by: David Rientjes
    Cc: "Luck, Tony"
    Cc: Oleg Drokin
    Cc: Boris Petkov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     

15 Aug, 2015

3 commits

  • sem_lock() did not properly pair memory barriers:

    !spin_is_locked() and spin_unlock_wait() are both only control barriers.
    The code needs an acquire barrier, otherwise the cpu might perform read
    operations before the lock test.

    As no primitive exists inside and since it seems
    noone wants another primitive, the code creates a local primitive within
    ipc/sem.c.

    With regards to -stable:

    The change of sem_wait_array() is a bugfix, the change to sem_lock() is a
    nop (just a preprocessor redefinition to improve the readability). The
    bugfix is necessary for all kernels that use sem_wait_array() (i.e.:
    starting from 3.10).

    Signed-off-by: Manfred Spraul
    Reported-by: Oleg Nesterov
    Acked-by: Peter Zijlstra (Intel)
    Cc: "Paul E. McKenney"
    Cc: Kirill Tkhai
    Cc: Ingo Molnar
    Cc: Josh Poimboeuf
    Cc: Davidlohr Bueso
    Cc: [3.10+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • After we acquire the sma->sem_perm lock in exit_sem(), we are protected
    against a racing IPC_RMID operation. Also at that point, we are the last
    user of sem_undo_list. Therefore it isn't required that we acquire or use
    ulp->lock.

    Signed-off-by: Herton R. Krzesinski
    Acked-by: Manfred Spraul
    Cc: Davidlohr Bueso
    Cc: Rafael Aquini
    CC: Aristeu Rozanski
    Cc: David Jeffery
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Herton R. Krzesinski
     
  • The current semaphore code allows a potential use after free: in
    exit_sem we may free the task's sem_undo_list while there is still
    another task looping through the same semaphore set and cleaning the
    sem_undo list at freeary function (the task called IPC_RMID for the same
    semaphore set).

    For example, with a test program [1] running which keeps forking a lot
    of processes (which then do a semop call with SEM_UNDO flag), and with
    the parent right after removing the semaphore set with IPC_RMID, and a
    kernel built with CONFIG_SLAB, CONFIG_SLAB_DEBUG and
    CONFIG_DEBUG_SPINLOCK, you can easily see something like the following
    in the kernel log:

    Slab corruption (Not tainted): kmalloc-64 start=ffff88003b45c1c0, len=64
    000: 6b 6b 6b 6b 6b 6b 6b 6b 00 6b 6b 6b 6b 6b 6b 6b kkkkkkkk.kkkkkkk
    010: ff ff ff ff 6b 6b 6b 6b ff ff ff ff ff ff ff ff ....kkkk........
    Prev obj: start=ffff88003b45c180, len=64
    000: 00 00 00 00 ad 4e ad de ff ff ff ff 5a 5a 5a 5a .....N......ZZZZ
    010: ff ff ff ff ff ff ff ff c0 fb 01 37 00 88 ff ff ...........7....
    Next obj: start=ffff88003b45c200, len=64
    000: 00 00 00 00 ad 4e ad de ff ff ff ff 5a 5a 5a 5a .....N......ZZZZ
    010: ff ff ff ff ff ff ff ff 68 29 a7 3c 00 88 ff ff ........h). 8b 84 24 88 03 00 00 49 8d 8c 24 60 05 00 00 8b 53 04 48 89
    RIP [] spin_dump+0x53/0xc0
    RSP
    ---[ end trace 783ebb76612867a0 ]---
    NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [test:18053]
    Modules linked in: 8021q mrp garp stp llc nf_conntrack_ipv4 nf_defrag_ipv4 ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables binfmt_misc ppdev input_leds joydev parport_pc parport floppy serio_raw virtio_balloon virtio_rng virtio_console virtio_net iosf_mbi crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcspkr qxl ttm drm_kms_helper drm snd_hda_codec_generic i2c_piix4 snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore crc32c_intel virtio_pci virtio_ring virtio pata_acpi ata_generic [last unloaded: speedstep_lib]
    CPU: 3 PID: 18053 Comm: test Tainted: G D 4.2.0-rc5+ #1
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.1-20150318_183358- 04/01/2014
    RIP: native_read_tsc+0x0/0x20
    Call Trace:
    ? delay_tsc+0x40/0x70
    __delay+0xf/0x20
    do_raw_spin_lock+0x96/0x140
    _raw_spin_lock+0xe/0x10
    sem_lock_and_putref+0x11/0x70
    SYSC_semtimedop+0x7bf/0x960
    ? handle_mm_fault+0xbf6/0x1880
    ? dequeue_task_fair+0x79/0x4a0
    ? __do_page_fault+0x19a/0x430
    ? kfree_debugcheck+0x16/0x40
    ? __do_page_fault+0x19a/0x430
    ? __audit_syscall_entry+0xa8/0x100
    ? do_audit_syscall_entry+0x66/0x70
    ? syscall_trace_enter_phase1+0x139/0x160
    SyS_semtimedop+0xe/0x10
    SyS_semop+0x10/0x20
    entry_SYSCALL_64_fastpath+0x12/0x71
    Code: 47 10 83 e8 01 85 c0 89 47 10 75 08 65 48 89 3d 1f 74 ff 7e c9 c3 0f 1f 44 00 00 55 48 89 e5 e8 87 17 04 00 66 90 c9 c3 0f 1f 00 48 89 e5 0f 31 89 c1 48 89 d0 48 c1 e0 20 89 c9 48 09 c8 c9
    Kernel panic - not syncing: softlockup: hung tasks

    I wasn't able to trigger any badness on a recent kernel without the
    proper config debugs enabled, however I have softlockup reports on some
    kernel versions, in the semaphore code, which are similar as above (the
    scenario is seen on some servers running IBM DB2 which uses semaphore
    syscalls).

    The patch here fixes the race against freeary, by acquiring or waiting
    on the sem_undo_list lock as necessary (exit_sem can race with freeary,
    while freeary sets un->semid to -1 and removes the same sem_undo from
    list_proc or when it removes the last sem_undo).

    After the patch I'm unable to reproduce the problem using the test case
    [1].

    [1] Test case used below:

    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #define NSEM 1
    #define NSET 5

    int sid[NSET];

    void thread()
    {
    struct sembuf op;
    int s;
    uid_t pid = getuid();

    s = rand() % NSET;
    op.sem_num = pid % NSEM;
    op.sem_op = 1;
    op.sem_flg = SEM_UNDO;

    semop(sid[s], &op, 1);
    exit(EXIT_SUCCESS);
    }

    void create_set()
    {
    int i, j;
    pid_t p;
    union {
    int val;
    struct semid_ds *buf;
    unsigned short int *array;
    struct seminfo *__buf;
    } un;

    /* Create and initialize semaphore set */
    for (i = 0; i < NSET; i++) {
    sid[i] = semget(IPC_PRIVATE , NSEM, 0644 | IPC_CREAT);
    if (sid[i] < 0) {
    perror("semget");
    exit(EXIT_FAILURE);
    }
    }
    un.val = 0;
    for (i = 0; i < NSET; i++) {
    for (j = 0; j < NSEM; j++) {
    if (semctl(sid[i], j, SETVAL, un) < 0)
    perror("semctl");
    }
    }

    /* Launch threads that operate on semaphore set */
    for (i = 0; i < NSEM * NSET * NSET; i++) {
    p = fork();
    if (p < 0)
    perror("fork");
    if (p == 0)
    thread();
    }

    /* Free semaphore set */
    for (i = 0; i < NSET; i++) {
    if (semctl(sid[i], NSEM, IPC_RMID))
    perror("IPC_RMID");
    }

    /* Wait for forked processes to exit */
    while (wait(NULL)) {
    if (errno == ECHILD)
    break;
    };
    }

    int main(int argc, char **argv)
    {
    pid_t p;

    srand(time(NULL));

    while (1) {
    p = fork();
    if (p < 0) {
    perror("fork");
    exit(EXIT_FAILURE);
    }
    if (p == 0) {
    create_set();
    goto end;
    }

    /* Wait for forked processes to exit */
    while (wait(NULL)) {
    if (errno == ECHILD)
    break;
    };
    }
    end:
    return 0;
    }

    [akpm@linux-foundation.org: use normal comment layout]
    Signed-off-by: Herton R. Krzesinski
    Acked-by: Manfred Spraul
    Cc: Davidlohr Bueso
    Cc: Rafael Aquini
    CC: Aristeu Rozanski
    Cc: David Jeffery
    Cc:
    Signed-off-by: Andrew Morton

    Signed-off-by: Linus Torvalds

    Herton R. Krzesinski
     

01 Jul, 2015

1 commit


16 Apr, 2015

1 commit

  • The seq_printf return value, because it's frequently misused,
    will eventually be converted to void.

    See: commit 1f33c41c03da ("seq_file: Rename seq_overflow() to
    seq_has_overflowed() and make public")

    Signed-off-by: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     

18 Feb, 2015

1 commit

  • Call __set_current_state() instead of assigning the new state directly.
    These interfaces also aid CONFIG_DEBUG_ATOMIC_SLEEP environments, keeping
    track of who changed the state.

    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     

14 Dec, 2014

1 commit

  • When I fixed bugs in the sem_lock() logic, I was more conservative than
    necessary. Therefore it is safe to replace the smp_mb() with smp_rmb().
    And: With smp_rmb(), semop() syscalls are up to 10% faster.

    The race we must protect against is:

    sem->lock is free
    sma->complex_count = 0
    sma->sem_perm.lock held by thread B

    thread A:

    A: spin_lock(&sem->lock)

    B: sma->complex_count++; (now 1)
    B: spin_unlock(&sma->sem_perm.lock);

    A: spin_is_locked(&sma->sem_perm.lock);
    A: XXXXX memory barrier
    A: if (sma->complex_count == 0)

    Thread A must read the increased complex_count value, i.e. the read must
    not be reordered with the read of sem_perm.lock done by spin_is_locked().

    Since it's about ordering of reads, smp_rmb() is sufficient.

    [akpm@linux-foundation.org: update sem_lock() comment, from Davidlohr]
    Signed-off-by: Manfred Spraul
    Reviewed-by: Davidlohr Bueso
    Acked-by: Rafael Aquini
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     

04 Dec, 2014

1 commit

  • ipc_addid() makes a new ipc identifier visible to everyone. New objects
    start as locked, so that the caller can complete the initialization
    after the call. Within struct sem_array, at least sma->sem_base and
    sma->sem_nsems are accessed without any locks, therefore this approach
    doesn't work.

    Thus: Move the ipc_addid() to the end of the initialization.

    Signed-off-by: Manfred Spraul
    Reported-by: Rik van Riel
    Acked-by: Rik van Riel
    Acked-by: Davidlohr Bueso
    Acked-by: Rafael Aquini
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     

07 Jun, 2014

9 commits

  • The actual Linux implementation for semctl(GETNCNT) and semctl(GETZCNT)
    always (since 0.99.10) reported a thread as sleeping on all semaphores
    that are listed in the semop() call.

    The documented behavior (both in the Linux man page and in the Single
    Unix Specification) is that a task should be reported on exactly one
    semaphore: The semaphore that caused the thread to got to sleep.

    This patch adds a pr_info_once() that is triggered if a thread hits the
    relevant case.

    The code triggers slightly too often, otherwise it would be necessary to
    replicate the old code. As there are no known users of GETNCNT or
    GETZCNT, this is done to prevent unnecessary bloat.

    The task that triggered is reported with name (tsk->comm) and pid.

    Signed-off-by: Manfred Spraul
    Acked-by: Davidlohr Bueso
    Cc: Michael Kerrisk
    Cc: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • SUSv4 clearly defines how semncnt and semzcnt must be calculated: A task
    waits on exactly one semaphore: The semaphore from the first operation
    in the sop array that cannot proceed.

    The Linux implementation never followed the standard, it tried to count
    all semaphores that might be the reason why a task sleeps.

    This patch fixes that.

    Note:
    a) The implementation assumes that GETNCNT and GETZCNT are rare operations,
    therefore the code counts them only on demand.
    (If they wouldn't be rare, then the non-compliance would have
    been found earlier)

    b) compared to the initial version of the patch, the BUG_ONs were removed
    and it was clarified that the new behavior conforms to SUS.

    Back-compatibility concerns:

    Manfred:

    : - there is no application in Fedora that uses GETNCNT or GETZCNT.
    :
    : - application that use only single-sop semop() are also safe, the
    : difference only affects complex apps.
    :
    : - portable application are also safe, the new behavior is standard
    : compliant.
    :
    : But that's it. The old behavior existed in Linux from 0.99.something
    : until now.

    Michael:

    : * These operations seem to be very little used. Grepping the public
    : source that is contained Fedora 20 source DVD, there appear to be no
    : uses. Of course, this says nothing about uses in private /
    : non-mainstream FOSS code, but it seems likely that the same pattern
    : is followed there.
    :
    : * The existing behavior is hard enough to understand that I suspect
    : that no one understood it well enough to rely on it anyway
    : (especially as that behavior contradicted both man page and POSIX).
    :
    : So, there's a chance of breakage, but I estimate that it's minute.

    Signed-off-by: Manfred Spraul
    Cc: Davidlohr Bueso
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • Preparation for the next patch:

    In the slow-path of perform_atomic_semop(), store a pointer to the
    operation that caused the operation to block.

    Signed-off-by: Manfred Spraul
    Cc: Davidlohr Bueso
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • Right now, perform_atomic_semop gets the content of sem_queue as
    individual fields. Changes that, instead pass a pointer to sem_queue.

    This is a preparation for the next patch: it uses sem_queue to store the
    reason why a task must sleep.

    Signed-off-by: Manfred Spraul
    Cc: Davidlohr Bueso
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • count_semzcnt and count_semncnt are more of less identical. The patch
    creates a single function that either counts the number of tasks waiting
    for zero or waiting due to a decrease operation.

    Compared to the initial version, the BUG_ONs were removed.

    Signed-off-by: Manfred Spraul
    Cc: Davidlohr Bueso
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • GETZCNT is supposed to return the number of threads that wait until a
    semaphore value becomes 0.

    The current implementation overlooks complex operations that contain
    both wait-for-zero operation and operations that alter at least one
    semaphore.

    The patch fixes that. It's intentionally copy&paste, this will be
    cleaned up in the next patch.

    Signed-off-by: Manfred Spraul
    Cc: Davidlohr Bueso
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • trailing whitespace

    Signed-off-by: Paul McQuade
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul McQuade
     
  • Use #include instead of
    Use #include instead of

    Signed-off-by: Paul McQuade
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul McQuade
     
  • There is no need to recreate the very same ipc_ops structure on every
    kernel entry for msgget/semget/shmget. Just declare it static and be
    done with it. While at it, constify it as we don't modify the structure
    at runtime.

    Found in the PaX patch, written by the PaX Team.

    Signed-off-by: Mathias Krause
    Cc: PaX Team
    Cc: Davidlohr Bueso
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mathias Krause
     

28 Jan, 2014

6 commits

  • Deal with checkpatch messages:
    WARNING: braces {} are not necessary for single statement blocks

    Signed-off-by: Davidlohr Bueso
    Cc: Aswin Chandramouleeswaran
    Cc: Rik van Riel
    Acked-by: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • IPC commenting style is all over the place, *specially* in util.c. This
    patch orders things a bit.

    Signed-off-by: Davidlohr Bueso
    Cc: Aswin Chandramouleeswaran
    Cc: Rik van Riel
    Acked-by: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • The ipc code does not adhere the typical linux coding style.
    This patch fixes lots of simple whitespace errors.

    - mostly autogenerated by
    scripts/checkpatch.pl -f --fix \
    --types=pointer_location,spacing,space_before_tab
    - one manual fixup (keep structure members tab-aligned)
    - removal of additional space_before_tab that were not found by --fix

    Tested with some of my msg and sem test apps.

    Andrew: Could you include it in -mm and move it towards Linus' tree?

    Signed-off-by: Manfred Spraul
    Suggested-by: Li Bin
    Cc: Joe Perches
    Acked-by: Rafael Aquini
    Cc: Davidlohr Bueso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • struct kern_ipc_perm.deleted is meant to be used as a boolean toggle, and
    the changes introduced by this patch are just to make the case explicit.

    Signed-off-by: Rafael Aquini
    Reviewed-by: Rik van Riel
    Cc: Greg Thelen
    Acked-by: Davidlohr Bueso
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael Aquini
     
  • After the locking semantics for the SysV IPC API got improved, a couple
    of IPC_RMID race windows were opened because we ended up dropping the
    'kern_ipc_perm.deleted' check performed way down in ipc_lock(). The
    spotted races got sorted out by re-introducing the old test within the
    racy critical sections.

    This patch introduces ipc_valid_object() to consolidate the way we cope
    with IPC_RMID races by using the same abstraction across the API
    implementation.

    Signed-off-by: Rafael Aquini
    Acked-by: Rik van Riel
    Acked-by: Greg Thelen
    Reviewed-by: Davidlohr Bueso
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael Aquini
     
  • When trying to understand semop code, I found a small mistake in the check
    for semadj (undo) value overflow. The new undo value is not stored
    immediately and next potential checks are done against the old value.

    The failing scenario is not much practical. One semop call has to do more
    operations on the same semaphore. Also semval and semadj must have
    different values, so there has to be some operations without SEM_UNDO
    flag. For example:

    struct sembuf depositor_op[1];
    struct sembuf collector_op[2];

    depositor_op[0].sem_num = 0;
    depositor_op[0].sem_op = 20000;
    depositor_op[0].sem_flg = 0;

    collector_op[0].sem_num = 0;
    collector_op[0].sem_op = -10000;
    collector_op[0].sem_flg = SEM_UNDO;
    collector_op[1].sem_num = 0;
    collector_op[1].sem_op = -10000;
    collector_op[1].sem_flg = SEM_UNDO;

    if (semop(semid, depositor_op, 1) == -1)
    { perror("Failed to do 1st deposit"); return 1; }

    if (semop(semid, collector_op, 2) == -1)
    { perror("Failed to do 1st collect"); return 1; }

    if (semop(semid, depositor_op, 1) == -1)
    { perror("Failed to do 2nd deposit"); return 1; }

    if (semop(semid, collector_op, 2) == -1)
    { perror("Failed to do 2nd collect"); return 1; }

    return 0;

    It passes without error now but the semadj value has overflown in the 2nd
    collector operation.

    [akpm@linux-foundation.org: restore lessened scope of local `undo']
    [davidlohr@hp.com: correct header comment for perform_atomic_semop]
    Signed-off-by: Petr Mladek
    Acked-by: Davidlohr Bueso
    Acked-by: Manfred Spraul
    Cc: Jiri Kosina
    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Petr Mladek
     

17 Oct, 2013

1 commit

  • After acquiring the semlock spinlock, operations must test that the
    array is still valid.

    - semctl() and exit_sem() would walk stale linked lists (ugly, but
    should be ok: all lists are empty)

    - semtimedop() would sleep forever - and if woken up due to a signal -
    access memory after free.

    The patch also:
    - standardizes the tests for .deleted, so that all tests in one
    function leave the function with the same approach.
    - unconditionally tests for .deleted immediately after every call to
    sem_lock - even it it means that for semctl(GETALL), .deleted will be
    tested twice.

    Both changes make the review simpler: After every sem_lock, there must
    be a test of .deleted, followed by a goto to the cleanup code (if the
    function uses "goto cleanup").

    The only exception is semctl_down(): If sem_ids().rwsem is locked, then
    the presence in ids->ipcs_idr is equivalent to !.deleted, thus no
    additional test is required.

    Signed-off-by: Manfred Spraul
    Cc: Mike Galbraith
    Acked-by: Davidlohr Bueso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     

01 Oct, 2013

4 commits

  • In commit 0a2b9d4c7967 ("ipc/sem.c: move wake_up_process out of the
    spinlock section"), the update of semaphore's sem_otime(last semop time)
    was moved to one central position (do_smart_update).

    But since do_smart_update() is only called for operations that modify
    the array, this means that wait-for-zero semops do not update sem_otime
    anymore.

    The fix is simple:
    Non-alter operations must update sem_otime.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Manfred Spraul
    Reported-by: Jia He
    Tested-by: Jia He
    Cc: Davidlohr Bueso
    Cc: Mike Galbraith
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • The proc interface is not aware of sem_lock(), it instead calls
    ipc_lock_object() directly. This means that simple semop() operations
    can run in parallel with the proc interface. Right now, this is
    uncritical, because the implementation doesn't do anything that requires
    a proper synchronization.

    But it is dangerous and therefore should be fixed.

    Signed-off-by: Manfred Spraul
    Cc: Davidlohr Bueso
    Cc: Mike Galbraith
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • Operations that need access to the whole array must guarantee that there
    are no simple operations ongoing. Right now this is achieved by
    spin_unlock_wait(sem->lock) on all semaphores.

    If complex_count is nonzero, then this spin_unlock_wait() is not
    necessary, because it was already performed in the past by the thread
    that increased complex_count and even though sem_perm.lock was dropped
    inbetween, no simple operation could have started, because simple
    operations cannot start when complex_count is non-zero.

    Signed-off-by: Manfred Spraul
    Cc: Mike Galbraith
    Cc: Rik van Riel
    Reviewed-by: Davidlohr Bueso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • The exclusion of complex operations in sem_lock() is insufficient: after
    acquiring the per-semaphore lock, a simple op must first check that
    sem_perm.lock is not locked and only after that test check
    complex_count. The current code does it the other way around - and that
    creates a race. Details are below.

    The patch is a complete rewrite of sem_lock(), based in part on the code
    from Mike Galbraith. It removes all gotos and all loops and thus the
    risk of livelocks.

    I have tested the patch (together with the next one) on my i3 laptop and
    it didn't cause any problems.

    The bug is probably also present in 3.10 and 3.11, but for these kernels
    it might be simpler just to move the test of sma->complex_count after
    the spin_is_locked() test.

    Details of the bug:

    Assume:
    - sma->complex_count = 0.
    - Thread 1: semtimedop(complex op that must sleep)
    - Thread 2: semtimedop(simple op).

    Pseudo-Trace:

    Thread 1: sem_lock(): acquire sem_perm.lock
    Thread 1: sem_lock(): check for ongoing simple ops
    Nothing ongoing, thread 2 is still before sem_lock().
    Thread 1: try_atomic_semop()
    <<< preempted.

    Thread 2: sem_lock():
    static inline int sem_lock(struct sem_array *sma, struct sembuf *sops,
    int nsops)
    {
    int locknum;
    again:
    if (nsops == 1 && !sma->complex_count) {
    struct sem *sem = sma->sem_base + sops->sem_num;

    /* Lock just the semaphore we are interested in. */
    spin_lock(&sem->lock);

    /*
    * If sma->complex_count was set while we were spinning,
    * we may need to look at things we did not lock here.
    */
    if (unlikely(sma->complex_count)) {
    spin_unlock(&sem->lock);
    goto lock_array;
    }
    <<<<<<<<<
    <<< complex_count is still 0.
    <<<
    <<< Here it is preempted
    <<<<<<<<<

    Thread 1: try_atomic_semop() returns, notices that it must sleep.
    Thread 1: increases sma->complex_count.
    Thread 1: drops sem_perm.lock
    Thread 2:
    /*
    * Another process is holding the global lock on the
    * sem_array; we cannot enter our critical section,
    * but have to wait for the global lock to be released.
    */
    if (unlikely(spin_is_locked(&sma->sem_perm.lock))) {
    spin_unlock(&sem->lock);
    spin_unlock_wait(&sma->sem_perm.lock);
    goto again;
    }
    <<< sem_perm.lock already dropped, thus no "goto again;"

    locknum = sops->sem_num;

    Signed-off-by: Manfred Spraul
    Cc: Mike Galbraith
    Cc: Rik van Riel
    Cc: Davidlohr Bueso
    Cc: [3.10+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     

25 Sep, 2013

1 commit

  • Currently, IPC mechanisms do security and auditing related checks under
    RCU. However, since security modules can free the security structure,
    for example, through selinux_[sem,msg_queue,shm]_free_security(), we can
    race if the structure is freed before other tasks are done with it,
    creating a use-after-free condition. Manfred illustrates this nicely,
    for instance with shared mem and selinux:

    -> do_shmat calls rcu_read_lock()
    -> do_shmat calls shm_object_check().
    Checks that the object is still valid - but doesn't acquire any locks.
    Then it returns.
    -> do_shmat calls security_shm_shmat (e.g. selinux_shm_shmat)
    -> selinux_shm_shmat calls ipc_has_perm()
    -> ipc_has_perm accesses ipc_perms->security

    shm_close()
    -> shm_close acquires rw_mutex & shm_lock
    -> shm_close calls shm_destroy
    -> shm_destroy calls security_shm_free (e.g. selinux_shm_free_security)
    -> selinux_shm_free_security calls ipc_free_security(&shp->shm_perm)
    -> ipc_free_security calls kfree(ipc_perms->security)

    This patch delays the freeing of the security structures after all RCU
    readers are done. Furthermore it aligns the security life cycle with
    that of the rest of IPC - freeing them based on the reference counter.
    For situations where we need not free security, the current behavior is
    kept. Linus states:

    "... the old behavior was suspect for another reason too: having the
    security blob go away from under a user sounds like it could cause
    various other problems anyway, so I think the old code was at least
    _prone_ to bugs even if it didn't have catastrophic behavior."

    I have tested this patch with IPC testcases from LTP on both my
    quad-core laptop and on a 64 core NUMA server. In both cases selinux is
    enabled, and tests pass for both voluntary and forced preemption models.
    While the mentioned races are theoretical (at least no one as reported
    them), I wanted to make sure that this new logic doesn't break anything
    we weren't aware of.

    Suggested-by: Linus Torvalds
    Signed-off-by: Davidlohr Bueso
    Acked-by: Manfred Spraul
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     

12 Sep, 2013

1 commit

  • Since in some situations the lock can be shared for readers, we shouldn't
    be calling it a mutex, rename it to rwsem.

    Signed-off-by: Davidlohr Bueso
    Tested-by: Sedat Dilek
    Cc: Rik van Riel
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     

10 Jul, 2013

3 commits

  • Cleanup: Some minor points that I noticed while writing the previous
    patches

    1) The name try_atomic_semop() is misleading: The function performs the
    operation (if it is possible).

    2) Some documentation updates.

    No real code change, a rename and documentation changes.

    Signed-off-by: Manfred Spraul
    Cc: Rik van Riel
    Cc: Davidlohr Bueso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • sem_otime contains the time of the last semaphore operation that
    completed successfully. Every operation updates this value, thus access
    from multiple cpus can cause thrashing.

    Therefore the patch replaces the variable with a per-semaphore variable.
    The per-array sem_otime is only calculated when required.

    No performance improvement on a single-socket i3 - only important for
    larger systems.

    Signed-off-by: Manfred Spraul
    Cc: Rik van Riel
    Cc: Davidlohr Bueso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • There are two places that can contain alter operations:
    - the global queue: sma->pending_alter
    - the per-semaphore queues: sma->sem_base[].pending_alter.

    Since one of the queues must be processed first, this causes an odd
    priorization of the wakeups: complex operations have priority over
    simple ops.

    The patch restores the behavior of linux pending_alter is used.
    - otherwise, the per-semaphore queues are used.

    As a side effect, do_smart_update_queue() becomes much simpler: no more
    goto logic.

    Signed-off-by: Manfred Spraul
    Cc: Rik van Riel
    Cc: Davidlohr Bueso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul