15 May, 2020

1 commit

  • Commit 89163f93c6f9 ("ipc/util.c: sysvipc_find_ipc() should increase
    position index") is causing this bug (seen on 5.6.8):

    # ipcs -q

    ------ Message Queues --------
    key msqid owner perms used-bytes messages

    # ipcmk -Q
    Message queue id: 0
    # ipcs -q

    ------ Message Queues --------
    key msqid owner perms used-bytes messages
    0x82db8127 0 root 644 0 0

    # ipcmk -Q
    Message queue id: 1
    # ipcs -q

    ------ Message Queues --------
    key msqid owner perms used-bytes messages
    0x82db8127 0 root 644 0 0
    0x76d1fb2a 1 root 644 0 0

    # ipcrm -q 0
    # ipcs -q

    ------ Message Queues --------
    key msqid owner perms used-bytes messages
    0x76d1fb2a 1 root 644 0 0
    0x76d1fb2a 1 root 644 0 0

    # ipcmk -Q
    Message queue id: 2
    # ipcrm -q 2
    # ipcs -q

    ------ Message Queues --------
    key msqid owner perms used-bytes messages
    0x76d1fb2a 1 root 644 0 0
    0x76d1fb2a 1 root 644 0 0

    # ipcmk -Q
    Message queue id: 3
    # ipcrm -q 1
    # ipcs -q

    ------ Message Queues --------
    key msqid owner perms used-bytes messages
    0x7c982867 3 root 644 0 0
    0x7c982867 3 root 644 0 0
    0x7c982867 3 root 644 0 0
    0x7c982867 3 root 644 0 0

    Whenever an IPC item with a low id is deleted, the items with higher ids
    are duplicated, as if filling a hole.

    new_pos should jump through hole of unused ids, pos can be updated
    inside "for" cycle.

    Fixes: 89163f93c6f9 ("ipc/util.c: sysvipc_find_ipc() should increase position index")
    Reported-by: Andreas Schwab
    Reported-by: Randy Dunlap
    Signed-off-by: Vasily Averin
    Signed-off-by: Andrew Morton
    Acked-by: Waiman Long
    Cc: NeilBrown
    Cc: Steven Rostedt
    Cc: Ingo Molnar
    Cc: Peter Oberparleiter
    Cc: Davidlohr Bueso
    Cc: Manfred Spraul
    Cc:
    Link: http://lkml.kernel.org/r/4921fe9b-9385-a2b4-1dc4-1099be6d2e39@virtuozzo.com
    Signed-off-by: Linus Torvalds

    Vasily Averin
     

11 Apr, 2020

1 commit

  • If seq_file .next function does not change position index, read after
    some lseek can generate unexpected output.

    https://bugzilla.kernel.org/show_bug.cgi?id=206283
    Signed-off-by: Vasily Averin
    Signed-off-by: Andrew Morton
    Acked-by: Waiman Long
    Cc: Davidlohr Bueso
    Cc: Manfred Spraul
    Cc: Al Viro
    Cc: Ingo Molnar
    Cc: NeilBrown
    Cc: Peter Oberparleiter
    Cc: Steven Rostedt
    Link: http://lkml.kernel.org/r/b7a20945-e315-8bb0-21e6-3875c14a8494@virtuozzo.com
    Signed-off-by: Linus Torvalds

    Vasily Averin
     

08 Apr, 2020

1 commit

  • Now that "struct proc_ops" exist we can start putting there stuff which
    could not fly with VFS "struct file_operations"...

    Most of fs/proc/inode.c file is dedicated to make open/read/.../close
    reliable in the event of disappearing /proc entries which usually happens
    if module is getting removed. Files like /proc/cpuinfo which never
    disappear simply do not need such protection.

    Save 2 atomic ops, 1 allocation, 1 free per open/read/close sequence for such
    "permanent" files.

    Enable "permanent" flag for

    /proc/cpuinfo
    /proc/kmsg
    /proc/modules
    /proc/slabinfo
    /proc/stat
    /proc/sysvipc/*
    /proc/swaps

    More will come once I figure out foolproof way to prevent out module
    authors from marking their stuff "permanent" for performance reasons
    when it is not.

    This should help with scalability: benchmark is "read /proc/cpuinfo R times
    by N threads scattered over the system".

    N R t, s (before) t, s (after)
    -----------------------------------------------------
    64 4096 1.582458 1.530502 -3.2%
    256 4096 6.371926 6.125168 -3.9%
    1024 4096 25.64888 24.47528 -4.6%

    Benchmark source:

    #include
    #include
    #include
    #include

    #include
    #include
    #include
    #include

    const int NR_CPUS = sysconf(_SC_NPROCESSORS_ONLN);
    int N;
    const char *filename;
    int R;

    int xxx = 0;

    int glue(int n)
    {
    cpu_set_t m;
    CPU_ZERO(&m);
    CPU_SET(n, &m);
    return sched_setaffinity(0, sizeof(cpu_set_t), &m);
    }

    void f(int n)
    {
    glue(n % NR_CPUS);

    while (*(volatile int *)&xxx == 0) {
    }

    for (int i = 0; i < R; i++) {
    int fd = open(filename, O_RDONLY);
    char buf[4096];
    ssize_t rv = read(fd, buf, sizeof(buf));
    asm volatile ("" :: "g" (rv));
    close(fd);
    }
    }

    int main(int argc, char *argv[])
    {
    if (argc < 4) {
    std::cerr << "usage: " << argv[0] << ' ' << "N /proc/filename R
    ";
    return 1;
    }

    N = atoi(argv[1]);
    filename = argv[2];
    R = atoi(argv[3]);

    for (int i = 0; i < NR_CPUS; i++) {
    if (glue(i) == 0)
    break;
    }

    std::vector T;
    T.reserve(N);
    for (int i = 0; i < N; i++) {
    T.emplace_back(f, i);
    }

    auto t0 = std::chrono::system_clock::now();
    {
    *(volatile int *)&xxx = 1;
    for (auto& t: T) {
    t.join();
    }
    }
    auto t1 = std::chrono::system_clock::now();
    std::chrono::duration dt = t1 - t0;
    std::cout << dt.count() << '
    ';

    return 0;
    }

    P.S.:
    Explicit randomization marker is added because adding non-function pointer
    will silently disable structure layout randomization.

    [akpm@linux-foundation.org: coding style fixes]
    Reported-by: kbuild test robot
    Reported-by: Dan Carpenter
    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Cc: Al Viro
    Cc: Joe Perches
    Link: http://lkml.kernel.org/r/20200222201539.GA22576@avx2
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

04 Feb, 2020

1 commit

  • The most notable change is DEFINE_SHOW_ATTRIBUTE macro split in
    seq_file.h.

    Conversion rule is:

    llseek => proc_lseek
    unlocked_ioctl => proc_ioctl

    xxx => proc_xxx

    delete ".owner = THIS_MODULE" line

    [akpm@linux-foundation.org: fix drivers/isdn/capi/kcapi_proc.c]
    [sfr@canb.auug.org.au: fix kernel/sched/psi.c]
    Link: http://lkml.kernel.org/r/20200122180545.36222f50@canb.auug.org.au
    Link: http://lkml.kernel.org/r/20191225172546.GB13378@avx2
    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

10 Dec, 2019

1 commit

  • Replace all the occurrences of FIELD_SIZEOF() with sizeof_field() except
    at places where these are defined. Later patches will remove the unused
    definition of FIELD_SIZEOF().

    This patch is generated using following script:

    EXCLUDE_FILES="include/linux/stddef.h|include/linux/kernel.h"

    git grep -l -e "\bFIELD_SIZEOF\b" | while read file;
    do

    if [[ "$file" =~ $EXCLUDE_FILES ]]; then
    continue
    fi
    sed -i -e 's/\bFIELD_SIZEOF\b/sizeof_field/g' $file;
    done

    Signed-off-by: Pankaj Bharadiya
    Link: https://lore.kernel.org/r/20190924105839.110713-3-pankaj.laxminarayan.bharadiya@intel.com
    Co-developed-by: Kees Cook
    Signed-off-by: Kees Cook
    Acked-by: David Miller # for net

    Pankaj Bharadiya
     

15 May, 2019

3 commits

  • For ipcmni_extend mode, the sequence number space is only 7 bits. So
    the chance of id reuse is relatively high compared with the non-extended
    mode.

    To alleviate this id reuse problem, this patch enables cyclic allocation
    for the index to the radix tree (idx). The disadvantage is that this
    can cause a slight slow-down of the fast path, as the radix tree could
    be higher than necessary.

    To limit the radix tree height, I have chosen the following limits:
    1) The cycling is done over in_use*1.5.
    2) At least, the cycling is done over
    "normal" ipcnmi mode: RADIX_TREE_MAP_SIZE elements
    "ipcmni_extended": 4096 elements

    Result:
    - for normal mode:
    No change for 4095 active objects until the 3rd level
    is added without cyclic allocation.

    For a 2-level radix tree compared to a 1-level radix tree, I have
    observed < 1% performance impact.

    Notes:
    1) Normal "x=semget();y=semget();" is unaffected: Then the idx
    is e.g. a and a+1, regardless if idr_alloc() or idr_alloc_cyclic()
    is used.

    2) The -1% happens in a microbenchmark after this situation:
    x=semget();
    for(i=0;i<
    Acked-by: Waiman Long
    Cc: "Luis R. Rodriguez"
    Cc: Kees Cook
    Cc: Jonathan Corbet
    Cc: Al Viro
    Cc: Matthew Wilcox
    Cc: "Eric W . Biederman"
    Cc: Takashi Iwai
    Cc: Davidlohr Bueso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • Rewrite, based on the patch from Waiman Long:

    The mixing in of a sequence number into the IPC IDs is probably to avoid
    ID reuse in userspace as much as possible. With ipcmni_extend mode, the
    number of usable sequence numbers is greatly reduced leading to higher
    chance of ID reuse.

    To address this issue, we need to conserve the sequence number space as
    much as possible. Right now, the sequence number is incremented for
    every new ID created. In reality, we only need to increment the
    sequence number when new allocated ID is not greater than the last one
    allocated. It is in such case that the new ID may collide with an
    existing one. This is being done irrespective of the ipcmni mode.

    In order to avoid any races, the index is first allocated and then the
    pointer is replaced.

    Changes compared to the initial patch:
    - Handle failures from idr_alloc().
    - Avoid that concurrent operations can see the wrong sequence number.
    (This is achieved by using idr_replace()).
    - IPCMNI_SEQ_SHIFT is not a constant, thus renamed to
    ipcmni_seq_shift().
    - IPCMNI_SEQ_MAX is not a constant, thus renamed to ipcmni_seq_max().

    Link: http://lkml.kernel.org/r/20190329204930.21620-2-longman@redhat.com
    Signed-off-by: Manfred Spraul
    Signed-off-by: Waiman Long
    Suggested-by: Matthew Wilcox
    Acked-by: Waiman Long
    Cc: Al Viro
    Cc: Davidlohr Bueso
    Cc: "Eric W . Biederman"
    Cc: Jonathan Corbet
    Cc: Kees Cook
    Cc: "Luis R. Rodriguez"
    Cc: Takashi Iwai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • The maximum number of unique System V IPC identifiers was limited to
    32k. That limit should be big enough for most use cases.

    However, there are some users out there requesting for more, especially
    those that are migrating from Solaris which uses 24 bits for unique
    identifiers. To satisfy the need of those users, a new boot time kernel
    option "ipcmni_extend" is added to extend the IPCMNI value to 16M. This
    is a 512X increase which should be big enough for users out there that
    need a large number of unique IPC identifier.

    The use of this new option will change the pattern of the IPC
    identifiers returned by functions like shmget(2). An application that
    depends on such pattern may not work properly. So it should only be
    used if the users really need more than 32k of unique IPC numbers.

    This new option does have the side effect of reducing the maximum number
    of unique sequence numbers from 64k down to 128. So it is a trade-off.

    The computation of a new IPC id is not done in the performance critical
    path. So a little bit of additional overhead shouldn't have any real
    performance impact.

    Link: http://lkml.kernel.org/r/20190329204930.21620-1-longman@redhat.com
    Signed-off-by: Waiman Long
    Acked-by: Manfred Spraul
    Cc: Al Viro
    Cc: Davidlohr Bueso
    Cc: "Eric W . Biederman"
    Cc: Jonathan Corbet
    Cc: Kees Cook
    Cc: "Luis R. Rodriguez"
    Cc: Matthew Wilcox
    Cc: Takashi Iwai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Waiman Long
     

08 Apr, 2019

1 commit

  • This patch changes rhashtables to use a bit_spin_lock on BIT(1) of the
    bucket pointer to lock the hash chain for that bucket.

    The benefits of a bit spin_lock are:
    - no need to allocate a separate array of locks.
    - no need to have a configuration option to guide the
    choice of the size of this array
    - locking cost is often a single test-and-set in a cache line
    that will have to be loaded anyway. When inserting at, or removing
    from, the head of the chain, the unlock is free - writing the new
    address in the bucket head implicitly clears the lock bit.
    For __rhashtable_insert_fast() we ensure this always happens
    when adding a new key.
    - even when lockings costs 2 updates (lock and unlock), they are
    in a cacheline that needs to be read anyway.

    The cost of using a bit spin_lock is a little bit of code complexity,
    which I think is quite manageable.

    Bit spin_locks are sometimes inappropriate because they are not fair -
    if multiple CPUs repeatedly contend of the same lock, one CPU can
    easily be starved. This is not a credible situation with rhashtable.
    Multiple CPUs may want to repeatedly add or remove objects, but they
    will typically do so at different buckets, so they will attempt to
    acquire different locks.

    As we have more bit-locks than we previously had spinlocks (by at
    least a factor of two) we can expect slightly less contention to
    go with the slightly better cache behavior and reduced memory
    consumption.

    To enhance type checking, a new struct is introduced to represent the
    pointer plus lock-bit
    that is stored in the bucket-table. This is "struct rhash_lock_head"
    and is empty. A pointer to this needs to be cast to either an
    unsigned lock, or a "struct rhash_head *" to be useful.
    Variables of this type are most often called "bkt".

    Previously "pprev" would sometimes point to a bucket, and sometimes a
    ->next pointer in an rhash_head. As these are now different types,
    pprev is NULL when it would have pointed to the bucket. In that case,
    'blk' is used, together with correct locking protocol.

    Signed-off-by: NeilBrown
    Signed-off-by: David S. Miller

    NeilBrown
     

23 Aug, 2018

9 commits

  • ipc_getref has still a return value of type "int", matching the atomic_t
    interface of atomic_inc_not_zero()/atomic_add_unless().

    ipc_getref now uses refcount_inc_not_zero, which has a return value of
    type "bool".

    Therefore, update the return code to avoid implicit conversions.

    Link: http://lkml.kernel.org/r/20180712185241.4017-13-manfred@colorfullife.com
    Signed-off-by: Manfred Spraul
    Cc: Davidlohr Bueso
    Cc: Davidlohr Bueso
    Cc: Dmitry Vyukov
    Cc: Herbert Xu
    Cc: Kees Cook
    Cc: Michael Kerrisk
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • The varable names got a mess, thus standardize them again:

    id: user space id. Called semid, shmid, msgid if the type is known.
    Most functions use "id" already.
    idx: "index" for the idr lookup
    Right now, some functions use lid, ipc_addid() already uses idx as
    the variable name.
    seq: sequence number, to avoid quick collisions of the user space id
    key: user space key, used for the rhash tree

    Link: http://lkml.kernel.org/r/20180712185241.4017-12-manfred@colorfullife.com
    Signed-off-by: Manfred Spraul
    Cc: Dmitry Vyukov
    Cc: Davidlohr Bueso
    Cc: Davidlohr Bueso
    Cc: Herbert Xu
    Cc: Kees Cook
    Cc: Michael Kerrisk
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • Now that we know that rhashtable_init() will not fail, we can get rid of a
    lot of the unnecessary cleanup paths when the call errored out.

    [manfred@colorfullife.com: variable name added to util.h to resolve checkpatch warning]
    Link: http://lkml.kernel.org/r/20180712185241.4017-11-manfred@colorfullife.com
    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Manfred Spraul
    Cc: Dmitry Vyukov
    Cc: Herbert Xu
    Cc: Kees Cook
    Cc: Michael Kerrisk
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • In sysvipc we have an ids->tables_initialized regarding the rhashtable,
    introduced in 0cfb6aee70bd ("ipc: optimize semget/shmget/msgget for lots
    of keys")

    It's there, specifically, to prevent nil pointer dereferences, from using
    an uninitialized api. Considering how rhashtable_init() can fail
    (probably due to ENOMEM, if anything), this made the overall ipc
    initialization capable of failure as well. That alone is ugly, but fine,
    however I've spotted a few issues regarding the semantics of
    tables_initialized (however unlikely they may be):

    - There is inconsistency in what we return to userspace: ipc_addid()
    returns ENOSPC which is certainly _wrong_, while ipc_obtain_object_idr()
    returns EINVAL.

    - After we started using rhashtables, ipc_findkey() can return nil upon
    !tables_initialized, but the caller expects nil for when the ipc
    structure isn't found, and can therefore call into ipcget() callbacks.

    Now that rhashtable initialization cannot fail, we can properly get rid of
    the hack altogether.

    [manfred@colorfullife.com: commit id extended to 12 digits]
    Link: http://lkml.kernel.org/r/20180712185241.4017-10-manfred@colorfullife.com
    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Manfred Spraul
    Cc: Dmitry Vyukov
    Cc: Herbert Xu
    Cc: Kees Cook
    Cc: Michael Kerrisk
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • ipc/util.c contains multiple functions to get the ipc object pointer given
    an id number.

    There are two sets of function: One set verifies the sequence counter part
    of the id number, other functions do not check the sequence counter.

    The standard for function names in ipc/util.c is
    - ..._check() functions verify the sequence counter
    - ..._idr() functions do not verify the sequence counter

    ipc_lock() is an exception: It does not verify the sequence counter value,
    but this is not obvious from the function name.

    Furthermore, shm.c is the only user of this helper. Thus, we can simply
    move the logic into shm_lock() and get rid of the function altogether.

    [manfred@colorfullife.com: most of changelog]
    Link: http://lkml.kernel.org/r/20180712185241.4017-7-manfred@colorfullife.com
    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Manfred Spraul
    Cc: Dmitry Vyukov
    Cc: Herbert Xu
    Cc: Kees Cook
    Cc: Michael Kerrisk
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • The comment that explains ipc_obtain_object_check is wrong: The function
    checks the sequence number, not the reference counter.

    Note that checking the reference counter would be meaningless: The
    reference counter is decreased without holding any locks, thus an object
    with kern_ipc_perm.deleted=true may disappear at the end of the next rcu
    grace period.

    Link: http://lkml.kernel.org/r/20180712185241.4017-6-manfred@colorfullife.com
    Signed-off-by: Manfred Spraul
    Reviewed-by: Davidlohr Bueso
    Cc: Davidlohr Bueso
    Cc: Dmitry Vyukov
    Cc: Herbert Xu
    Cc: Kees Cook
    Cc: Michael Kerrisk
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • Both the comment and the name of ipcctl_pre_down_nolock() are misleading:
    The function must be called while holdling the rw semaphore.

    Therefore the patch renames the function to ipcctl_obtain_check(): This
    name matches the other names used in util.c:

    - "obtain" function look up a pointer in the idr, without
    acquiring the object lock.
    - The caller is responsible for locking.
    - _check means that the sequence number is checked.

    Link: http://lkml.kernel.org/r/20180712185241.4017-5-manfred@colorfullife.com
    Signed-off-by: Manfred Spraul
    Reviewed-by: Davidlohr Bueso
    Cc: Davidlohr Bueso
    Cc: Dmitry Vyukov
    Cc: Herbert Xu
    Cc: Kees Cook
    Cc: Michael Kerrisk
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • ipc_addid() is impossible to use:
    - for certain failures, the caller must not use ipc_rcu_putref(),
    because the reference counter is not yet initialized.
    - for other failures, the caller must use ipc_rcu_putref(),
    because parallel operations could be ongoing already.

    The patch cleans that up, by initializing the refcount early, and by
    modifying all callers.

    The issues is related to the finding of
    syzbot+2827ef6b3385deb07eaf@syzkaller.appspotmail.com: syzbot found an
    issue with reading kern_ipc_perm.seq, here both read and write to already
    released memory could happen.

    Link: http://lkml.kernel.org/r/20180712185241.4017-4-manfred@colorfullife.com
    Signed-off-by: Manfred Spraul
    Cc: Dmitry Vyukov
    Cc: Kees Cook
    Cc: Davidlohr Bueso
    Cc: Davidlohr Bueso
    Cc: Herbert Xu
    Cc: Michael Kerrisk
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • ipc_addid() initializes kern_ipc_perm.seq after having called idr_alloc()
    (within ipc_idr_alloc()).

    Thus a parallel semop() or msgrcv() that uses ipc_obtain_object_check()
    may see an uninitialized value.

    The patch moves the initialization of kern_ipc_perm.seq before the calls
    of idr_alloc().

    Notes:
    1) This patch has a user space visible side effect:
    If /proc/sys/kernel/*_next_id is used (i.e.: checkpoint/restore) and
    if semget()/msgget()/shmget() fails in the final step of adding the id
    to the rhash tree, then .._next_id is cleared. Before the patch, is
    remained unmodified.

    There is no change of the behavior after a successful ..get() call: It
    always clears .._next_id, there is no impact to non checkpoint/restore
    code as that code does not use .._next_id.

    2) The patch correctly documents that after a call to ipc_idr_alloc(),
    the full tear-down sequence must be used. The callers of ipc_addid()
    do not fullfill that, i.e. more bugfixes are required.

    The patch is a squash of a patch from Dmitry and my own changes.

    Link: http://lkml.kernel.org/r/20180712185241.4017-3-manfred@colorfullife.com
    Reported-by: syzbot+2827ef6b3385deb07eaf@syzkaller.appspotmail.com
    Signed-off-by: Manfred Spraul
    Cc: Dmitry Vyukov
    Cc: Kees Cook
    Cc: Davidlohr Bueso
    Cc: Michael Kerrisk
    Cc: Davidlohr Bueso
    Cc: Herbert Xu
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     

22 Jun, 2018

1 commit

  • Due to the use of rhashtables in net namespaces,
    rhashtable.h is included in lots of the kernel,
    so a small changes can required a large recompilation.
    This makes development painful.

    This patch splits out rhashtable-types.h which just includes
    the major type declarations, and does not include (non-trivial)
    inline code. rhashtable.h is no longer included by anything
    in the include/ directory.
    Common include files only include rhashtable-types.h so a large
    recompilation is only triggered when that changes.

    Acked-by: Herbert Xu
    Signed-off-by: NeilBrown
    Signed-off-by: David S. Miller

    NeilBrown
     

12 Apr, 2018

1 commit

  • Move the proc_mkdir() call within the sysvipc subsystem such that we
    avoid polluting proc_root_init() with petty cpp.

    [dave@stgolabs.net: contributed changelog]
    Link: http://lkml.kernel.org/r/20180216161732.GA10297@avx2
    Signed-off-by: Alexey Dobriyan
    Reviewed-by: Andrew Morton
    Acked-by: Davidlohr Bueso
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

25 Mar, 2018

1 commit

  • Capture the pid namespace when /proc/sysvipc/msg /proc/sysvipc/shm
    and /proc/sysvipc/sem are opened, and make it available through
    the new helper ipc_seq_pid_ns.

    This makes it possible to report the pids in these files in the
    pid namespace of the opener of the files.

    Implement ipc_update_pid. A simple impline helper that will only update
    a struct pid pointer if the new value does not equal the old value. This
    removes the need for wordy code sequences like:

    old = object->pid;
    object->pid = new;
    put_pid(old);

    and

    old = object->pid;
    if (old != new) {
    object->pid = new;
    put_pid(old);
    }

    Allowing the following to be written instead:

    ipc_update_pid(&object->pid, new);

    Which is easier to read and ensures that the pid reference count is
    not touched the old and the new values are the same. Not touching
    the reference count in this case is important to help avoid issues
    like af_unix experienced, where multiple threads of the same
    process managed to bounce the struct pid between cpu cache lines,
    but updating the pids reference count.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

07 Feb, 2018

1 commit

  • As described in the title, this patch fixes id_ds inconsistency when
    ctl_stat executes concurrently with some ds-changing function, e.g.
    shmat, msgsnd or whatever.

    For instance, if shmctl(IPC_STAT) is running concurrently
    with shmat, following data structure can be returned:
    {... shm_lpid = 0, shm_nattch = 1, ...}

    Link: http://lkml.kernel.org/r/20171202153456.6514-1-philippe.mikoyan@skat.systems
    Signed-off-by: Philippe Mikoyan
    Reviewed-by: Davidlohr Bueso
    Cc: Al Viro
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Philippe Mikoyan
     

18 Nov, 2017

3 commits

  • For a custom microbenchmark on a 3.30GHz Xeon SandyBridge, which calls
    IPC_STAT over and over, it was calculated that, on avg the cost of
    ipc_get_maxid() for increasing amounts of keys was:

    10 keys: ~900 cycles
    100 keys: ~15000 cycles
    1000 keys: ~150000 cycles
    10000 keys: ~2100000 cycles

    This is unsurprising as maxid is currently O(n).

    By having the max_id available in O(1) we save all those cycles for each
    semctl(_STAT) command, the idr_find can be expensive -- which some real
    (customer) workloads actually poll on.

    Note that this used to be the case, until commit 7ca7e564e04 ("ipc:
    store ipcs into IDRs"). The cost is the extra idr_find when doing
    RMIDs, but we simply go backwards, and should not take too many
    iterations to find the new value.

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/20170831172049.14576-5-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • This is better understood as a limit, instead of size; exactly like the
    function comment indicates. Rename it.

    Link: http://lkml.kernel.org/r/20170831172049.14576-4-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • Patch series "sysvipc: ipc-key management improvements".

    Here are a few improvements I spotted while eyeballing Guillaume's
    rhashtable implementation for ipc keys. The first and fourth patches
    are the interesting ones, the middle two are trivial.

    This patch (of 4):

    The next_id object-allocation functionality was introduced in commit
    03f595668017 ("ipc: add sysctl to specify desired next object id").

    Given that these new entries are _only_ exported under the
    CONFIG_CHECKPOINT_RESTORE option, there is no point for the common case
    to even know about ->next_id. As such rewrite ipc_buildid() such that
    it can do away with the field as well as unnecessary branches when
    adding a new identifier. The end result also better differentiates both
    cases, so the code ends up being cleaner; albeit the small duplications
    regarding the default case.

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/20170831172049.14576-2-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

09 Sep, 2017

2 commits

  • ipc_findkey() used to scan all objects to look for the wanted key. This
    is slow when using a high number of keys. This change adds an rhashtable
    of kern_ipc_perm objects in ipc_ids, so that one lookup cease to be O(n).

    This change gives a 865% improvement of benchmark reaim.jobs_per_min on a
    56 threads Intel(R) Xeon(R) CPU E5-2695 v3 @ 2.30GHz with 256G memory [1]

    Other (more micro) benchmark results, by the author: On an i5 laptop, the
    following loop executed right after a reboot took, without and with this
    change:

    for (int i = 0, k=0x424242; i < KEYS; ++i)
    semget(k++, 1, IPC_CREAT | 0600);

    total total max single max single
    KEYS without with call without call with

    1 3.5 4.9 µs 3.5 4.9
    10 7.6 8.6 µs 3.7 4.7
    32 16.2 15.9 µs 4.3 5.3
    100 72.9 41.8 µs 3.7 4.7
    1000 5,630.0 502.0 µs * *
    10000 1,340,000.0 7,240.0 µs * *
    31900 17,600,000.0 22,200.0 µs * *

    *: unreliable measure: high variance

    The duration for a lookup-only usage was obtained by the same loop once
    the keys are present:

    total total max single max single
    KEYS without with call without call with

    1 2.1 2.5 µs 2.1 2.5
    10 4.5 4.8 µs 2.2 2.3
    32 13.0 10.8 µs 2.3 2.8
    100 82.9 25.1 µs * 2.3
    1000 5,780.0 217.0 µs * *
    10000 1,470,000.0 2,520.0 µs * *
    31900 17,400,000.0 7,810.0 µs * *

    Finally, executing each semget() in a new process gave, when still
    summing only the durations of these syscalls:

    creation:
    total total
    KEYS without with

    1 3.7 5.0 µs
    10 32.9 36.7 µs
    32 125.0 109.0 µs
    100 523.0 353.0 µs
    1000 20,300.0 3,280.0 µs
    10000 2,470,000.0 46,700.0 µs
    31900 27,800,000.0 219,000.0 µs

    lookup-only:
    total total
    KEYS without with

    1 2.5 2.7 µs
    10 25.4 24.4 µs
    32 106.0 72.6 µs
    100 591.0 352.0 µs
    1000 22,400.0 2,250.0 µs
    10000 2,510,000.0 25,700.0 µs
    31900 28,200,000.0 115,000.0 µs

    [1] http://lkml.kernel.org/r/20170814060507.GE23258@yexl-desktop

    Link: http://lkml.kernel.org/r/20170815194954.ck32ta2z35yuzpwp@debix
    Signed-off-by: Guillaume Knispel
    Reviewed-by: Marc Pardo
    Cc: Davidlohr Bueso
    Cc: Kees Cook
    Cc: Manfred Spraul
    Cc: Alexey Dobriyan
    Cc: "Eric W. Biederman"
    Cc: "Peter Zijlstra (Intel)"
    Cc: Ingo Molnar
    Cc: Sebastian Andrzej Siewior
    Cc: Serge Hallyn
    Cc: Andrey Vagin
    Cc: Guillaume Knispel
    Cc: Marc Pardo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Guillaume Knispel
     
  • refcount_t type and corresponding API should be used instead of atomic_t
    when the variable is used as a reference counter. This allows to avoid
    accidental refcounter overflows that might lead to use-after-free
    situations.

    Link: http://lkml.kernel.org/r/1499417992-3238-4-git-send-email-elena.reshetova@intel.com
    Signed-off-by: Elena Reshetova
    Signed-off-by: Hans Liljestrand
    Signed-off-by: Kees Cook
    Signed-off-by: David Windsor
    Cc: Peter Zijlstra
    Cc: Greg Kroah-Hartman
    Cc: "Eric W. Biederman"
    Cc: Ingo Molnar
    Cc: Alexey Dobriyan
    Cc: Serge Hallyn
    Cc:
    Cc: Davidlohr Bueso
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Elena Reshetova
     

13 Jul, 2017

5 commits

  • Only after ipc_addid() has succeeded will refcounting be used, so move
    initialization into ipc_addid() and remove from open-coded *_alloc()
    routines.

    Link: http://lkml.kernel.org/r/20170525185107.12869-17-manfred@colorfullife.com
    Signed-off-by: Kees Cook
    Signed-off-by: Manfred Spraul
    Cc: Davidlohr Bueso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • No callers remain for ipc_rcu_alloc(). Drop the function.

    [manfred@colorfullife.com: Rediff because the memset was temporarily inside ipc_rcu_free()]
    Link: http://lkml.kernel.org/r/20170525185107.12869-13-manfred@colorfullife.com
    Signed-off-by: Kees Cook
    Signed-off-by: Manfred Spraul
    Cc: Kees Cook
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • There are no more callers of ipc_rcu_free(), so remove it.

    Link: http://lkml.kernel.org/r/20170525185107.12869-9-manfred@colorfullife.com
    Signed-off-by: Kees Cook
    Signed-off-by: Manfred Spraul
    Cc: Davidlohr Bueso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • The only users of ipc_alloc() were ipc_rcu_alloc() and the on-heap
    sem_io fall-back memory. Better to just open-code these to make things
    easier to read.

    [manfred@colorfullife.com: Rediff due to inclusion of memset() into ipc_rcu_alloc()]
    Link: http://lkml.kernel.org/r/20170525185107.12869-5-manfred@colorfullife.com
    Signed-off-by: Kees Cook
    Signed-off-by: Manfred Spraul
    Cc: Davidlohr Bueso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • ipc has two management structures that exist for every id:
    - struct kern_ipc_perm, it contains e.g. the permissions.
    - struct ipc_rcu, it contains the rcu head for rcu handling and the
    refcount.

    The patch merges both structures.

    As a bonus, we may save one cacheline, because both structures are
    cacheline aligned. In addition, it reduces the number of casts, instead
    most codepaths can use container_of.

    To simplify code, the ipc_rcu_alloc initializes the allocation to 0.

    [manfred@colorfullife.com: really include the memset() into ipc_alloc_rcu()]
    Link: http://lkml.kernel.org/r/564f8612-0601-b267-514f-a9f650ec9b32@colorfullife.com
    Link: http://lkml.kernel.org/r/20170525185107.12869-3-manfred@colorfullife.com
    Signed-off-by: Manfred Spraul
    Cc: Davidlohr Bueso
    Cc: Kees Cook
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     

09 May, 2017

1 commit

  • Patch series "kvmalloc", v5.

    There are many open coded kmalloc with vmalloc fallback instances in the
    tree. Most of them are not careful enough or simply do not care about
    the underlying semantic of the kmalloc/page allocator which means that
    a) some vmalloc fallbacks are basically unreachable because the kmalloc
    part will keep retrying until it succeeds b) the page allocator can
    invoke a really disruptive steps like the OOM killer to move forward
    which doesn't sound appropriate when we consider that the vmalloc
    fallback is available.

    As it can be seen implementing kvmalloc requires quite an intimate
    knowledge if the page allocator and the memory reclaim internals which
    strongly suggests that a helper should be implemented in the memory
    subsystem proper.

    Most callers, I could find, have been converted to use the helper
    instead. This is patch 6. There are some more relying on __GFP_REPEAT
    in the networking stack which I have converted as well and Eric Dumazet
    was not opposed [2] to convert them as well.

    [1] http://lkml.kernel.org/r/20170130094940.13546-1-mhocko@kernel.org
    [2] http://lkml.kernel.org/r/1485273626.16328.301.camel@edumazet-glaptop3.roam.corp.google.com

    This patch (of 9):

    Using kmalloc with the vmalloc fallback for larger allocations is a
    common pattern in the kernel code. Yet we do not have any common helper
    for that and so users have invented their own helpers. Some of them are
    really creative when doing so. Let's just add kv[mz]alloc and make sure
    it is implemented properly. This implementation makes sure to not make
    a large memory pressure for > PAGE_SZE requests (__GFP_NORETRY) and also
    to not warn about allocation failures. This also rules out the OOM
    killer as the vmalloc is a more approapriate fallback than a disruptive
    user visible action.

    This patch also changes some existing users and removes helpers which
    are specific for them. In some cases this is not possible (e.g.
    ext4_kvmalloc, libcfs_kvzalloc) because those seems to be broken and
    require GFP_NO{FS,IO} context which is not vmalloc compatible in general
    (note that the page table allocation is GFP_KERNEL). Those need to be
    fixed separately.

    While we are at it, document that __vmalloc{_node} about unsupported gfp
    mask because there seems to be a lot of confusion out there.
    kvmalloc_node will warn about GFP_KERNEL incompatible (which are not
    superset) flags to catch new abusers. Existing ones would have to die
    slowly.

    [sfr@canb.auug.org.au: f2fs fixup]
    Link: http://lkml.kernel.org/r/20170320163735.332e64b7@canb.auug.org.au
    Link: http://lkml.kernel.org/r/20170306103032.2540-2-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Signed-off-by: Stephen Rothwell
    Reviewed-by: Andreas Dilger [ext4 part]
    Acked-by: Vlastimil Babka
    Cc: John Hubbard
    Cc: David Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

03 Apr, 2017

1 commit

  • ./lib/string.c:134: WARNING: Inline emphasis start-string without end-string.
    ./mm/filemap.c:522: WARNING: Inline interpreted text or phrase reference start-string without end-string.
    ./mm/filemap.c:1283: ERROR: Unexpected indentation.
    ./mm/filemap.c:3003: WARNING: Inline interpreted text or phrase reference start-string without end-string.
    ./mm/vmalloc.c:1544: WARNING: Inline emphasis start-string without end-string.
    ./mm/page_alloc.c:4245: ERROR: Unexpected indentation.
    ./ipc/util.c:676: ERROR: Unexpected indentation.
    ./drivers/pci/irq.c:35: WARNING: Block quote ends without a blank line; unexpected unindent.
    ./security/security.c:109: ERROR: Unexpected indentation.
    ./security/security.c:110: WARNING: Definition list ends without a blank line; unexpected unindent.
    ./block/genhd.c:275: WARNING: Inline strong start-string without end-string.
    ./block/genhd.c:283: WARNING: Inline strong start-string without end-string.
    ./include/linux/clk.h:134: WARNING: Inline emphasis start-string without end-string.
    ./include/linux/clk.h:134: WARNING: Inline emphasis start-string without end-string.
    ./ipc/util.c:477: ERROR: Unknown target name: "s".

    Signed-off-by: Mauro Carvalho Chehab
    Acked-by: Bjorn Helgaas
    Signed-off-by: Jonathan Corbet

    mchehab@s-opensource.com
     

23 Jan, 2016

1 commit

  • There are many locations that do

    if (memory_was_allocated_by_vmalloc)
    vfree(ptr);
    else
    kfree(ptr);

    but kvfree() can handle both kmalloc()ed memory and vmalloc()ed memory
    using is_vmalloc_addr(). Unless callers have special reasons, we can
    replace this branch with kvfree(). Please check and reply if you found
    problems.

    Signed-off-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Acked-by: Jan Kara
    Acked-by: Russell King
    Reviewed-by: Andreas Dilger
    Acked-by: "Rafael J. Wysocki"
    Acked-by: David Rientjes
    Cc: "Luck, Tony"
    Cc: Oleg Drokin
    Cc: Boris Petkov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     

01 Oct, 2015

1 commit

  • As reported by Dmitry Vyukov, we really shouldn't do ipc_addid() before
    having initialized the IPC object state. Yes, we initialize the IPC
    object in a locked state, but with all the lockless RCU lookup work,
    that IPC object lock no longer means that the state cannot be seen.

    We already did this for the IPC semaphore code (see commit e8577d1f0329:
    "ipc/sem.c: fully initialize sem_array before making it visible") but we
    clearly forgot about msg and shm.

    Reported-by: Dmitry Vyukov
    Cc: Manfred Spraul
    Cc: Davidlohr Bueso
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

01 Jul, 2015

3 commits

  • In ipc_obtain_object_check we return -EIDRM when a bogus sequence number
    is detected via ipc_checkid, while the ipc manpages state the following
    return codes for such errors:

    EIDRM points to a removed identifier.
    EINVAL Invalid value, or unaligned, etc.

    EIDRM should only be returned upon a RMID call (->deleted check), and thus
    return EINVAL for wrong seq. This difference in semantics has also caused
    real bugs, ie: https://bugzilla.redhat.com/show_bug.cgi?id=246509

    Signed-off-by: Davidlohr Bueso
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • The ipc_lock helper is used by all forms of sysv ipc to acquire the ipc
    object's spinlock. Upon error (bogus identifier), we always return
    -EINVAL, whether the problem be in the idr path or because we raced with a
    task performing RMID. For the later, however, all ipc related manpages,
    state the that for:

    EIDRM points to a removed identifier.

    And return:

    EINVAL Invalid value, or unaligned, etc.

    Which (EINVAL) should only return once the ipc resource is deleted. For
    all types of ipc this is done immediately upon a RMID command. However,
    shared memory behaves slightly different as it can merely mark a segment
    for deletion, and delay the actual freeing until there are no more active
    consumers. Per shmctl(IPC_RMID) manpage:

    ""
    Mark the segment to be destroyed. The segment will only actually
    be destroyed after the last process detaches it (i.e., when the
    shm_nattch member of the associated structure shmid_ds is zero).
    ""

    Unlike ipc_lock, paths that behave "correctly", at least per the manpage,
    involve controlling the ipc resource via *ctl(), doing the exact same
    validity check as ipc_lock after right acquiring the spinlock:

    if (!ipc_valid_object()) {
    err = -EIDRM;
    goto out_unlock;
    }

    Thus make ipc_lock consistent with the rest of ipc code and return -EIDRM
    in ipc_lock when !ipc_valid_object().

    Signed-off-by: Davidlohr Bueso
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • ... to ipc_obtain_object_idr, which is more meaningful and makes the code
    slightly easier to follow.

    Signed-off-by: Davidlohr Bueso
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso