06 Oct, 2018

1 commit

  • This uses ERR_CAST() instead of an open-coded cast, as it is casting
    across structure pointers, which upsets __randomize_layout:

    ipc/shm.c: In function `shm_lock':
    ipc/shm.c:209:9: note: randstruct: casting between randomized structure pointer types (ssa): `struct shmid_kernel' and `struct kern_ipc_perm'

    return (void *)ipcp;
    ^~~~~~~~~~~~

    Link: http://lkml.kernel.org/r/20180919180722.GA15073@beast
    Fixes: 82061c57ce93 ("ipc: drop ipc_lock()")
    Signed-off-by: Kees Cook
    Cc: Davidlohr Bueso
    Cc: Manfred Spraul
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Kees Cook
     

05 Sep, 2018

1 commit

  • When getting rid of the general ipc_lock(), this was missed furthermore,
    making the comment around the ipc object validity check bogus. Under
    EIDRM conditions, callers will in turn not see the error and continue
    with the operation.

    Link: http://lkml.kernel.org/r/20180824030920.GD3677@linux-r8p5
    Link: http://lkml.kernel.org/r/20180823024051.GC13343@shao2-debian
    Fixes: 82061c57ce9 ("ipc: drop ipc_lock()")
    Signed-off-by: Davidlohr Bueso
    Reported-by: kernel test robot
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     

23 Aug, 2018

10 commits

  • ipc_getref has still a return value of type "int", matching the atomic_t
    interface of atomic_inc_not_zero()/atomic_add_unless().

    ipc_getref now uses refcount_inc_not_zero, which has a return value of
    type "bool".

    Therefore, update the return code to avoid implicit conversions.

    Link: http://lkml.kernel.org/r/20180712185241.4017-13-manfred@colorfullife.com
    Signed-off-by: Manfred Spraul
    Cc: Davidlohr Bueso
    Cc: Davidlohr Bueso
    Cc: Dmitry Vyukov
    Cc: Herbert Xu
    Cc: Kees Cook
    Cc: Michael Kerrisk
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • The varable names got a mess, thus standardize them again:

    id: user space id. Called semid, shmid, msgid if the type is known.
    Most functions use "id" already.
    idx: "index" for the idr lookup
    Right now, some functions use lid, ipc_addid() already uses idx as
    the variable name.
    seq: sequence number, to avoid quick collisions of the user space id
    key: user space key, used for the rhash tree

    Link: http://lkml.kernel.org/r/20180712185241.4017-12-manfred@colorfullife.com
    Signed-off-by: Manfred Spraul
    Cc: Dmitry Vyukov
    Cc: Davidlohr Bueso
    Cc: Davidlohr Bueso
    Cc: Herbert Xu
    Cc: Kees Cook
    Cc: Michael Kerrisk
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • Now that we know that rhashtable_init() will not fail, we can get rid of a
    lot of the unnecessary cleanup paths when the call errored out.

    [manfred@colorfullife.com: variable name added to util.h to resolve checkpatch warning]
    Link: http://lkml.kernel.org/r/20180712185241.4017-11-manfred@colorfullife.com
    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Manfred Spraul
    Cc: Dmitry Vyukov
    Cc: Herbert Xu
    Cc: Kees Cook
    Cc: Michael Kerrisk
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • In sysvipc we have an ids->tables_initialized regarding the rhashtable,
    introduced in 0cfb6aee70bd ("ipc: optimize semget/shmget/msgget for lots
    of keys")

    It's there, specifically, to prevent nil pointer dereferences, from using
    an uninitialized api. Considering how rhashtable_init() can fail
    (probably due to ENOMEM, if anything), this made the overall ipc
    initialization capable of failure as well. That alone is ugly, but fine,
    however I've spotted a few issues regarding the semantics of
    tables_initialized (however unlikely they may be):

    - There is inconsistency in what we return to userspace: ipc_addid()
    returns ENOSPC which is certainly _wrong_, while ipc_obtain_object_idr()
    returns EINVAL.

    - After we started using rhashtables, ipc_findkey() can return nil upon
    !tables_initialized, but the caller expects nil for when the ipc
    structure isn't found, and can therefore call into ipcget() callbacks.

    Now that rhashtable initialization cannot fail, we can properly get rid of
    the hack altogether.

    [manfred@colorfullife.com: commit id extended to 12 digits]
    Link: http://lkml.kernel.org/r/20180712185241.4017-10-manfred@colorfullife.com
    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Manfred Spraul
    Cc: Dmitry Vyukov
    Cc: Herbert Xu
    Cc: Kees Cook
    Cc: Michael Kerrisk
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • ipc/util.c contains multiple functions to get the ipc object pointer given
    an id number.

    There are two sets of function: One set verifies the sequence counter part
    of the id number, other functions do not check the sequence counter.

    The standard for function names in ipc/util.c is
    - ..._check() functions verify the sequence counter
    - ..._idr() functions do not verify the sequence counter

    ipc_lock() is an exception: It does not verify the sequence counter value,
    but this is not obvious from the function name.

    Furthermore, shm.c is the only user of this helper. Thus, we can simply
    move the logic into shm_lock() and get rid of the function altogether.

    [manfred@colorfullife.com: most of changelog]
    Link: http://lkml.kernel.org/r/20180712185241.4017-7-manfred@colorfullife.com
    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Manfred Spraul
    Cc: Dmitry Vyukov
    Cc: Herbert Xu
    Cc: Kees Cook
    Cc: Michael Kerrisk
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • The comment that explains ipc_obtain_object_check is wrong: The function
    checks the sequence number, not the reference counter.

    Note that checking the reference counter would be meaningless: The
    reference counter is decreased without holding any locks, thus an object
    with kern_ipc_perm.deleted=true may disappear at the end of the next rcu
    grace period.

    Link: http://lkml.kernel.org/r/20180712185241.4017-6-manfred@colorfullife.com
    Signed-off-by: Manfred Spraul
    Reviewed-by: Davidlohr Bueso
    Cc: Davidlohr Bueso
    Cc: Dmitry Vyukov
    Cc: Herbert Xu
    Cc: Kees Cook
    Cc: Michael Kerrisk
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • Both the comment and the name of ipcctl_pre_down_nolock() are misleading:
    The function must be called while holdling the rw semaphore.

    Therefore the patch renames the function to ipcctl_obtain_check(): This
    name matches the other names used in util.c:

    - "obtain" function look up a pointer in the idr, without
    acquiring the object lock.
    - The caller is responsible for locking.
    - _check means that the sequence number is checked.

    Link: http://lkml.kernel.org/r/20180712185241.4017-5-manfred@colorfullife.com
    Signed-off-by: Manfred Spraul
    Reviewed-by: Davidlohr Bueso
    Cc: Davidlohr Bueso
    Cc: Dmitry Vyukov
    Cc: Herbert Xu
    Cc: Kees Cook
    Cc: Michael Kerrisk
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • ipc_addid() is impossible to use:
    - for certain failures, the caller must not use ipc_rcu_putref(),
    because the reference counter is not yet initialized.
    - for other failures, the caller must use ipc_rcu_putref(),
    because parallel operations could be ongoing already.

    The patch cleans that up, by initializing the refcount early, and by
    modifying all callers.

    The issues is related to the finding of
    syzbot+2827ef6b3385deb07eaf@syzkaller.appspotmail.com: syzbot found an
    issue with reading kern_ipc_perm.seq, here both read and write to already
    released memory could happen.

    Link: http://lkml.kernel.org/r/20180712185241.4017-4-manfred@colorfullife.com
    Signed-off-by: Manfred Spraul
    Cc: Dmitry Vyukov
    Cc: Kees Cook
    Cc: Davidlohr Bueso
    Cc: Davidlohr Bueso
    Cc: Herbert Xu
    Cc: Michael Kerrisk
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • ipc_addid() initializes kern_ipc_perm.seq after having called idr_alloc()
    (within ipc_idr_alloc()).

    Thus a parallel semop() or msgrcv() that uses ipc_obtain_object_check()
    may see an uninitialized value.

    The patch moves the initialization of kern_ipc_perm.seq before the calls
    of idr_alloc().

    Notes:
    1) This patch has a user space visible side effect:
    If /proc/sys/kernel/*_next_id is used (i.e.: checkpoint/restore) and
    if semget()/msgget()/shmget() fails in the final step of adding the id
    to the rhash tree, then .._next_id is cleared. Before the patch, is
    remained unmodified.

    There is no change of the behavior after a successful ..get() call: It
    always clears .._next_id, there is no impact to non checkpoint/restore
    code as that code does not use .._next_id.

    2) The patch correctly documents that after a call to ipc_idr_alloc(),
    the full tear-down sequence must be used. The callers of ipc_addid()
    do not fullfill that, i.e. more bugfixes are required.

    The patch is a squash of a patch from Dmitry and my own changes.

    Link: http://lkml.kernel.org/r/20180712185241.4017-3-manfred@colorfullife.com
    Reported-by: syzbot+2827ef6b3385deb07eaf@syzkaller.appspotmail.com
    Signed-off-by: Manfred Spraul
    Cc: Dmitry Vyukov
    Cc: Kees Cook
    Cc: Davidlohr Bueso
    Cc: Michael Kerrisk
    Cc: Davidlohr Bueso
    Cc: Herbert Xu
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • ipc_addid() initializes kern_ipc_perm.id after having called
    ipc_idr_alloc().

    Thus a parallel semctl() or msgctl() that uses e.g. MSG_STAT may use this
    unitialized value as the return code.

    The patch moves all accesses to kern_ipc_perm.id under the spin_lock().

    The issues is related to the finding of
    syzbot+2827ef6b3385deb07eaf@syzkaller.appspotmail.com: syzbot found an
    issue with kern_ipc_perm.seq

    Link: http://lkml.kernel.org/r/20180712185241.4017-2-manfred@colorfullife.com
    Signed-off-by: Manfred Spraul
    Reviewed-by: Davidlohr Bueso
    Cc: Dmitry Vyukov
    Cc: Kees Cook
    Cc: Davidlohr Bueso
    Cc: Herbert Xu
    Cc: Michael Kerrisk
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     

16 Aug, 2018

1 commit

  • Pull networking updates from David Miller:
    "Highlights:

    - Gustavo A. R. Silva keeps working on the implicit switch fallthru
    changes.

    - Support 802.11ax High-Efficiency wireless in cfg80211 et al, From
    Luca Coelho.

    - Re-enable ASPM in r8169, from Kai-Heng Feng.

    - Add virtual XFRM interfaces, which avoids all of the limitations of
    existing IPSEC tunnels. From Steffen Klassert.

    - Convert GRO over to use a hash table, so that when we have many
    flows active we don't traverse a long list during accumluation.

    - Many new self tests for routing, TC, tunnels, etc. Too many
    contributors to mention them all, but I'm really happy to keep
    seeing this stuff.

    - Hardware timestamping support for dpaa_eth/fsl-fman from Yangbo Lu.

    - Lots of cleanups and fixes in L2TP code from Guillaume Nault.

    - Add IPSEC offload support to netdevsim, from Shannon Nelson.

    - Add support for slotting with non-uniform distribution to netem
    packet scheduler, from Yousuk Seung.

    - Add UDP GSO support to mlx5e, from Boris Pismenny.

    - Support offloading of Team LAG in NFP, from John Hurley.

    - Allow to configure TX queue selection based upon RX queue, from
    Amritha Nambiar.

    - Support ethtool ring size configuration in aquantia, from Anton
    Mikaev.

    - Support DSCP and flowlabel per-transport in SCTP, from Xin Long.

    - Support list based batching and stack traversal of SKBs, this is
    very exciting work. From Edward Cree.

    - Busyloop optimizations in vhost_net, from Toshiaki Makita.

    - Introduce the ETF qdisc, which allows time based transmissions. IGB
    can offload this in hardware. From Vinicius Costa Gomes.

    - Add parameter support to devlink, from Moshe Shemesh.

    - Several multiplication and division optimizations for BPF JIT in
    nfp driver, from Jiong Wang.

    - Lots of prepatory work to make more of the packet scheduler layer
    lockless, when possible, from Vlad Buslov.

    - Add ACK filter and NAT awareness to sch_cake packet scheduler, from
    Toke Høiland-Jørgensen.

    - Support regions and region snapshots in devlink, from Alex Vesker.

    - Allow to attach XDP programs to both HW and SW at the same time on
    a given device, with initial support in nfp. From Jakub Kicinski.

    - Add TLS RX offload and support in mlx5, from Ilya Lesokhin.

    - Use PHYLIB in r8169 driver, from Heiner Kallweit.

    - All sorts of changes to support Spectrum 2 in mlxsw driver, from
    Ido Schimmel.

    - PTP support in mv88e6xxx DSA driver, from Andrew Lunn.

    - Make TCP_USER_TIMEOUT socket option more accurate, from Jon
    Maxwell.

    - Support for templates in packet scheduler classifier, from Jiri
    Pirko.

    - IPV6 support in RDS, from Ka-Cheong Poon.

    - Native tproxy support in nf_tables, from Máté Eckl.

    - Maintain IP fragment queue in an rbtree, but optimize properly for
    in-order frags. From Peter Oskolkov.

    - Improvde handling of ACKs on hole repairs, from Yuchung Cheng"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1996 commits)
    bpf: test: fix spelling mistake "REUSEEPORT" -> "REUSEPORT"
    hv/netvsc: Fix NULL dereference at single queue mode fallback
    net: filter: mark expected switch fall-through
    xen-netfront: fix warn message as irq device name has '/'
    cxgb4: Add new T5 PCI device ids 0x50af and 0x50b0
    net: dsa: mv88e6xxx: missing unlock on error path
    rds: fix building with IPV6=m
    inet/connection_sock: prefer _THIS_IP_ to current_text_addr
    net: dsa: mv88e6xxx: bitwise vs logical bug
    net: sock_diag: Fix spectre v1 gadget in __sock_diag_cmd()
    ieee802154: hwsim: using right kind of iteration
    net: hns3: Add vlan filter setting by ethtool command -K
    net: hns3: Set tx ring' tc info when netdev is up
    net: hns3: Remove tx ring BD len register in hns3_enet
    net: hns3: Fix desc num set to default when setting channel
    net: hns3: Fix for phy link issue when using marvell phy driver
    net: hns3: Fix for information of phydev lost problem when down/up
    net: hns3: Fix for command format parsing error in hclge_is_all_function_id_zero
    net: hns3: Add support for serdes loopback selftest
    bnxt_en: take coredump_record structure off stack
    ...

    Linus Torvalds
     

14 Aug, 2018

1 commit

  • Pull vfs open-related updates from Al Viro:

    - "do we need fput() or put_filp()" rules are gone - it's always fput()
    now. We keep track of that state where it belongs - in ->f_mode.

    - int *opened mess killed - in finish_open(), in ->atomic_open()
    instances and in fs/namei.c code around do_last()/lookup_open()/atomic_open().

    - alloc_file() wrappers with saner calling conventions are introduced
    (alloc_file_clone() and alloc_file_pseudo()); callers converted, with
    much simplification.

    - while we are at it, saner calling conventions for path_init() and
    link_path_walk(), simplifying things inside fs/namei.c (both on
    open-related paths and elsewhere).

    * 'work.open3' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (40 commits)
    few more cleanups of link_path_walk() callers
    allow link_path_walk() to take ERR_PTR()
    make path_init() unconditionally paired with terminate_walk()
    document alloc_file() changes
    make alloc_file() static
    do_shmat(): grab shp->shm_file earlier, switch to alloc_file_clone()
    new helper: alloc_file_clone()
    create_pipe_files(): switch the first allocation to alloc_file_pseudo()
    anon_inode_getfile(): switch to alloc_file_pseudo()
    hugetlb_file_setup(): switch to alloc_file_pseudo()
    ocxlflash_getfile(): switch to alloc_file_pseudo()
    cxl_getfile(): switch to alloc_file_pseudo()
    ... and switch shmem_file_setup() to alloc_file_pseudo()
    __shmem_file_setup(): reorder allocations
    new wrapper: alloc_file_pseudo()
    kill FILE_{CREATED,OPENED}
    switch atomic_open() and lookup_open() to returning 0 in all success cases
    document ->atomic_open() changes
    ->atomic_open(): return 0 in all success cases
    get rid of 'opened' in path_openat() and the helpers downstream
    ...

    Linus Torvalds
     

06 Aug, 2018

1 commit


03 Aug, 2018

2 commits

  • Commit 05ea88608d4e ("mm, hugetlbfs: introduce ->pagesize() to
    vm_operations_struct") adds a new ->pagesize() function to
    hugetlb_vm_ops, intended to cover all hugetlbfs backed files.

    With System V shared memory model, if "huge page" is specified, the
    "shared memory" is backed by hugetlbfs files, but the mappings initiated
    via shmget/shmat have their original vm_ops overwritten with shm_vm_ops,
    so we need to add a ->pagesize function to shm_vm_ops. Otherwise,
    vma_kernel_pagesize() returns PAGE_SIZE given a hugetlbfs backed vma,
    result in below BUG:

    fs/hugetlbfs/inode.c
    443 if (unlikely(page_mapped(page))) {
    444 BUG_ON(truncate_op);

    resulting in

    hugetlbfs: oracle (4592): Using mlock ulimits for SHM_HUGETLB is deprecated
    ------------[ cut here ]------------
    kernel BUG at fs/hugetlbfs/inode.c:444!
    Modules linked in: nfsv3 rpcsec_gss_krb5 nfsv4 ...
    CPU: 35 PID: 5583 Comm: oracle_5583_sbt Not tainted 4.14.35-1829.el7uek.x86_64 #2
    RIP: 0010:remove_inode_hugepages+0x3db/0x3e2
    ....
    Call Trace:
    hugetlbfs_evict_inode+0x1e/0x3e
    evict+0xdb/0x1af
    iput+0x1a2/0x1f7
    dentry_unlink_inode+0xc6/0xf0
    __dentry_kill+0xd8/0x18d
    dput+0x1b5/0x1ed
    __fput+0x18b/0x216
    ____fput+0xe/0x10
    task_work_run+0x90/0xa7
    exit_to_usermode_loop+0xdd/0x116
    do_syscall_64+0x187/0x1ae
    entry_SYSCALL_64_after_hwframe+0x150/0x0

    [jane.chu@oracle.com: relocate comment]
    Link: http://lkml.kernel.org/r/20180731044831.26036-1-jane.chu@oracle.com
    Link: http://lkml.kernel.org/r/20180727211727.5020-1-jane.chu@oracle.com
    Fixes: 05ea88608d4e13 ("mm, hugetlbfs: introduce ->pagesize() to vm_operations_struct")
    Signed-off-by: Jane Chu
    Suggested-by: Mike Kravetz
    Reviewed-by: Mike Kravetz
    Acked-by: Davidlohr Bueso
    Acked-by: Michal Hocko
    Cc: Dan Williams
    Cc: Jan Kara
    Cc: Jérôme Glisse
    Cc: Manfred Spraul
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jane Chu
     
  • The BTF conflicts were simple overlapping changes.

    The virtio_net conflict was an overlap of a fix of statistics counter,
    happening alongisde a move over to a bonafide statistics structure
    rather than counting value on the stack.

    Signed-off-by: David S. Miller

    David S. Miller
     

27 Jul, 2018

1 commit

  • In order for load/store tearing prevention to work, _all_ accesses to
    the variable in question need to be done around READ and WRITE_ONCE()
    macros. Ensure everyone does so for q->status variable for
    semtimedop().

    Link: http://lkml.kernel.org/r/20180717052654.676-1-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     

12 Jul, 2018

2 commits


22 Jun, 2018

1 commit

  • Due to the use of rhashtables in net namespaces,
    rhashtable.h is included in lots of the kernel,
    so a small changes can required a large recompilation.
    This makes development painful.

    This patch splits out rhashtable-types.h which just includes
    the major type declarations, and does not include (non-trivial)
    inline code. rhashtable.h is no longer included by anything
    in the include/ directory.
    Common include files only include rhashtable-types.h so a large
    recompilation is only triggered when that changes.

    Acked-by: Herbert Xu
    Signed-off-by: NeilBrown
    Signed-off-by: David S. Miller

    NeilBrown
     

15 Jun, 2018

2 commits

  • Use new return type vm_fault_t for fault handler. For now, this is just
    documenting that the function returns a VM_FAULT value rather than an
    errno. Once all instances are converted, vm_fault_t will become a
    distinct type.

    Commit 1c8f422059ae ("mm: change return type to vm_fault_t")

    Link: http://lkml.kernel.org/r/20180425043413.GA21467@jordon-HP-15-Notebook-PC
    Signed-off-by: Souptick Joarder
    Reviewed-by: Matthew Wilcox
    Reviewed-by: Andrew Morton
    Acked-by: Davidlohr Bueso
    Cc: Manfred Spraul
    Cc: Eric W. Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Souptick Joarder
     
  • Both smatch and coverity are reporting potential issues with spectre
    variant 1 with the 'semnum' index within the sma->sems array, ie:

    ipc/sem.c:388 sem_lock() warn: potential spectre issue 'sma->sems'
    ipc/sem.c:641 perform_atomic_semop_slow() warn: potential spectre issue 'sma->sems'
    ipc/sem.c:721 perform_atomic_semop() warn: potential spectre issue 'sma->sems'

    Avoid any possible speculation by using array_index_nospec() thus
    ensuring the semnum value is bounded to [0, sma->sem_nsems). With the
    exception of sem_lock() all of these are slowpaths.

    Link: http://lkml.kernel.org/r/20180423171131.njs4rfm2yzyeg6do@linux-n805
    Signed-off-by: Davidlohr Bueso
    Reported-by: Dan Carpenter
    Cc: Peter Zijlstra
    Cc: "Gustavo A. R. Silva"
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     

13 Jun, 2018

1 commit

  • The kvmalloc() function has a 2-factor argument form, kvmalloc_array(). This
    patch replaces cases of:

    kvmalloc(a * b, gfp)

    with:
    kvmalloc_array(a * b, gfp)

    as well as handling cases of:

    kvmalloc(a * b * c, gfp)

    with:

    kvmalloc(array3_size(a, b, c), gfp)

    as it's slightly less ugly than:

    kvmalloc_array(array_size(a, b), c, gfp)

    This does, however, attempt to ignore constant size factors like:

    kvmalloc(4 * 1024, gfp)

    though any constants defined via macros get caught up in the conversion.

    Any factors with a sizeof() of "unsigned char", "char", and "u8" were
    dropped, since they're redundant.

    The Coccinelle script used for this was:

    // Fix redundant parens around sizeof().
    @@
    type TYPE;
    expression THING, E;
    @@

    (
    kvmalloc(
    - (sizeof(TYPE)) * E
    + sizeof(TYPE) * E
    , ...)
    |
    kvmalloc(
    - (sizeof(THING)) * E
    + sizeof(THING) * E
    , ...)
    )

    // Drop single-byte sizes and redundant parens.
    @@
    expression COUNT;
    typedef u8;
    typedef __u8;
    @@

    (
    kvmalloc(
    - sizeof(u8) * (COUNT)
    + COUNT
    , ...)
    |
    kvmalloc(
    - sizeof(__u8) * (COUNT)
    + COUNT
    , ...)
    |
    kvmalloc(
    - sizeof(char) * (COUNT)
    + COUNT
    , ...)
    |
    kvmalloc(
    - sizeof(unsigned char) * (COUNT)
    + COUNT
    , ...)
    |
    kvmalloc(
    - sizeof(u8) * COUNT
    + COUNT
    , ...)
    |
    kvmalloc(
    - sizeof(__u8) * COUNT
    + COUNT
    , ...)
    |
    kvmalloc(
    - sizeof(char) * COUNT
    + COUNT
    , ...)
    |
    kvmalloc(
    - sizeof(unsigned char) * COUNT
    + COUNT
    , ...)
    )

    // 2-factor product with sizeof(type/expression) and identifier or constant.
    @@
    type TYPE;
    expression THING;
    identifier COUNT_ID;
    constant COUNT_CONST;
    @@

    (
    - kvmalloc
    + kvmalloc_array
    (
    - sizeof(TYPE) * (COUNT_ID)
    + COUNT_ID, sizeof(TYPE)
    , ...)
    |
    - kvmalloc
    + kvmalloc_array
    (
    - sizeof(TYPE) * COUNT_ID
    + COUNT_ID, sizeof(TYPE)
    , ...)
    |
    - kvmalloc
    + kvmalloc_array
    (
    - sizeof(TYPE) * (COUNT_CONST)
    + COUNT_CONST, sizeof(TYPE)
    , ...)
    |
    - kvmalloc
    + kvmalloc_array
    (
    - sizeof(TYPE) * COUNT_CONST
    + COUNT_CONST, sizeof(TYPE)
    , ...)
    |
    - kvmalloc
    + kvmalloc_array
    (
    - sizeof(THING) * (COUNT_ID)
    + COUNT_ID, sizeof(THING)
    , ...)
    |
    - kvmalloc
    + kvmalloc_array
    (
    - sizeof(THING) * COUNT_ID
    + COUNT_ID, sizeof(THING)
    , ...)
    |
    - kvmalloc
    + kvmalloc_array
    (
    - sizeof(THING) * (COUNT_CONST)
    + COUNT_CONST, sizeof(THING)
    , ...)
    |
    - kvmalloc
    + kvmalloc_array
    (
    - sizeof(THING) * COUNT_CONST
    + COUNT_CONST, sizeof(THING)
    , ...)
    )

    // 2-factor product, only identifiers.
    @@
    identifier SIZE, COUNT;
    @@

    - kvmalloc
    + kvmalloc_array
    (
    - SIZE * COUNT
    + COUNT, SIZE
    , ...)

    // 3-factor product with 1 sizeof(type) or sizeof(expression), with
    // redundant parens removed.
    @@
    expression THING;
    identifier STRIDE, COUNT;
    type TYPE;
    @@

    (
    kvmalloc(
    - sizeof(TYPE) * (COUNT) * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kvmalloc(
    - sizeof(TYPE) * (COUNT) * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kvmalloc(
    - sizeof(TYPE) * COUNT * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kvmalloc(
    - sizeof(TYPE) * COUNT * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kvmalloc(
    - sizeof(THING) * (COUNT) * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    |
    kvmalloc(
    - sizeof(THING) * (COUNT) * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    |
    kvmalloc(
    - sizeof(THING) * COUNT * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    |
    kvmalloc(
    - sizeof(THING) * COUNT * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    )

    // 3-factor product with 2 sizeof(variable), with redundant parens removed.
    @@
    expression THING1, THING2;
    identifier COUNT;
    type TYPE1, TYPE2;
    @@

    (
    kvmalloc(
    - sizeof(TYPE1) * sizeof(TYPE2) * COUNT
    + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
    , ...)
    |
    kvmalloc(
    - sizeof(TYPE1) * sizeof(THING2) * (COUNT)
    + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
    , ...)
    |
    kvmalloc(
    - sizeof(THING1) * sizeof(THING2) * COUNT
    + array3_size(COUNT, sizeof(THING1), sizeof(THING2))
    , ...)
    |
    kvmalloc(
    - sizeof(THING1) * sizeof(THING2) * (COUNT)
    + array3_size(COUNT, sizeof(THING1), sizeof(THING2))
    , ...)
    |
    kvmalloc(
    - sizeof(TYPE1) * sizeof(THING2) * COUNT
    + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
    , ...)
    |
    kvmalloc(
    - sizeof(TYPE1) * sizeof(THING2) * (COUNT)
    + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
    , ...)
    )

    // 3-factor product, only identifiers, with redundant parens removed.
    @@
    identifier STRIDE, SIZE, COUNT;
    @@

    (
    kvmalloc(
    - (COUNT) * STRIDE * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kvmalloc(
    - COUNT * (STRIDE) * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kvmalloc(
    - COUNT * STRIDE * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kvmalloc(
    - (COUNT) * (STRIDE) * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kvmalloc(
    - COUNT * (STRIDE) * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kvmalloc(
    - (COUNT) * STRIDE * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kvmalloc(
    - (COUNT) * (STRIDE) * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kvmalloc(
    - COUNT * STRIDE * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    )

    // Any remaining multi-factor products, first at least 3-factor products,
    // when they're not all constants...
    @@
    expression E1, E2, E3;
    constant C1, C2, C3;
    @@

    (
    kvmalloc(C1 * C2 * C3, ...)
    |
    kvmalloc(
    - (E1) * E2 * E3
    + array3_size(E1, E2, E3)
    , ...)
    |
    kvmalloc(
    - (E1) * (E2) * E3
    + array3_size(E1, E2, E3)
    , ...)
    |
    kvmalloc(
    - (E1) * (E2) * (E3)
    + array3_size(E1, E2, E3)
    , ...)
    |
    kvmalloc(
    - E1 * E2 * E3
    + array3_size(E1, E2, E3)
    , ...)
    )

    // And then all remaining 2 factors products when they're not all constants,
    // keeping sizeof() as the second factor argument.
    @@
    expression THING, E1, E2;
    type TYPE;
    constant C1, C2, C3;
    @@

    (
    kvmalloc(sizeof(THING) * C2, ...)
    |
    kvmalloc(sizeof(TYPE) * C2, ...)
    |
    kvmalloc(C1 * C2 * C3, ...)
    |
    kvmalloc(C1 * C2, ...)
    |
    - kvmalloc
    + kvmalloc_array
    (
    - sizeof(TYPE) * (E2)
    + E2, sizeof(TYPE)
    , ...)
    |
    - kvmalloc
    + kvmalloc_array
    (
    - sizeof(TYPE) * E2
    + E2, sizeof(TYPE)
    , ...)
    |
    - kvmalloc
    + kvmalloc_array
    (
    - sizeof(THING) * (E2)
    + E2, sizeof(THING)
    , ...)
    |
    - kvmalloc
    + kvmalloc_array
    (
    - sizeof(THING) * E2
    + E2, sizeof(THING)
    , ...)
    |
    - kvmalloc
    + kvmalloc_array
    (
    - (E1) * E2
    + E1, E2
    , ...)
    |
    - kvmalloc
    + kvmalloc_array
    (
    - (E1) * (E2)
    + E1, E2
    , ...)
    |
    - kvmalloc
    + kvmalloc_array
    (
    - E1 * E2
    + E1, E2
    , ...)
    )

    Signed-off-by: Kees Cook

    Kees Cook
     

05 Jun, 2018

1 commit

  • Pull time/Y2038 updates from Thomas Gleixner:

    - Consolidate SySV IPC UAPI headers

    - Convert SySV IPC to the new COMPAT_32BIT_TIME mechanism

    - Cleanup the core interfaces and standardize on the ktime_get_* naming
    convention.

    - Convert the X86 platform ops to timespec64

    - Remove the ugly temporary timespec64 hack

    * 'timers-2038-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (22 commits)
    x86: Convert x86_platform_ops to timespec64
    timekeeping: Add more coarse clocktai/boottime interfaces
    timekeeping: Add ktime_get_coarse_with_offset
    timekeeping: Standardize on ktime_get_*() naming
    timekeeping: Clean up ktime_get_real_ts64
    timekeeping: Remove timespec64 hack
    y2038: ipc: Redirect ipc(SEMTIMEDOP, ...) to compat_ksys_semtimedop
    y2038: ipc: Enable COMPAT_32BIT_TIME
    y2038: ipc: Use __kernel_timespec
    y2038: ipc: Report long times to user space
    y2038: ipc: Use ktime_get_real_seconds consistently
    y2038: xtensa: Extend sysvipc data structures
    y2038: powerpc: Extend sysvipc data structures
    y2038: sparc: Extend sysvipc data structures
    y2038: parisc: Extend sysvipc data structures
    y2038: mips: Extend sysvipc data structures
    y2038: arm64: Extend sysvipc compat data structures
    y2038: s390: Remove unneeded ipc uapi header files
    y2038: ia64: Remove unneeded ipc uapi header files
    y2038: alpha: Remove unneeded ipc uapi header files
    ...

    Linus Torvalds
     

26 May, 2018

2 commits

  • shmat()'s SHM_REMAP option forbids passing a nil address for; this is in
    fact the very first thing we check for. Andrea reported that for
    SHM_RND|SHM_REMAP cases we can end up bypassing the initial addr check,
    but we need to check again if the address was rounded down to nil. As
    of this patch, such cases will return -EINVAL.

    Link: http://lkml.kernel.org/r/20180503204934.kk63josdu6u53fbd@linux-n805
    Signed-off-by: Davidlohr Bueso
    Reported-by: Andrea Arcangeli
    Cc: Joe Lawrence
    Cc: Manfred Spraul
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • Patch series "ipc/shm: shmat() fixes around nil-page".

    These patches fix two issues reported[1] a while back by Joe and Andrea
    around how shmat(2) behaves with nil-page.

    The first reverts a commit that it was incorrectly thought that mapping
    nil-page (address=0) was a no no with MAP_FIXED. This is not the case,
    with the exception of SHM_REMAP; which is address in the second patch.

    I chose two patches because it is easier to backport and it explicitly
    reverts bogus behaviour. Both patches ought to be in -stable and ltp
    testcases need updated (the added testcase around the cve can be
    modified to just test for SHM_RND|SHM_REMAP).

    [1] lkml.kernel.org/r/20180430172152.nfa564pvgpk3ut7p@linux-n805

    This patch (of 2):

    Commit 95e91b831f87 ("ipc/shm: Fix shmat mmap nil-page protection")
    worked on the idea that we should not be mapping as root addr=0 and
    MAP_FIXED. However, it was reported that this scenario is in fact
    valid, thus making the patch both bogus and breaks userspace as well.

    For example X11's libint10.so relies on shmat(1, SHM_RND) for lowmem
    initialization[1].

    [1] https://cgit.freedesktop.org/xorg/xserver/tree/hw/xfree86/os-support/linux/int10/linux.c#n347
    Link: http://lkml.kernel.org/r/20180503203243.15045-2-dave@stgolabs.net
    Fixes: 95e91b831f87 ("ipc/shm: Fix shmat mmap nil-page protection")
    Signed-off-by: Davidlohr Bueso
    Reported-by: Joe Lawrence
    Reported-by: Andrea Arcangeli
    Cc: Manfred Spraul
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     

20 Apr, 2018

5 commits

  • 32-bit architectures implementing 64BIT_TIME and COMPAT_32BIT_TIME
    need to have the traditional semtimedop() behavior with 32-bit timestamps
    for sys_ipc() by calling compat_ksys_semtimedop(), while those that
    are not yet converted need to keep using ksys_semtimedop() like
    64-bit architectures do.

    Note that I chose to not implement a new SEMTIMEDOP64 function that
    corresponds to the new sys_semtimedop() with 64-bit timeouts. The reason
    here is that sys_ipc() should no longer be used for new system calls,
    and libc should just call the semtimedop syscall directly.

    One open question remain to whether we want to completely avoid the
    sys_ipc() system call for architectures that do not yet have all the
    individual calls as they get converted to 64-bit time_t. Doing that
    would require adding several extra system calls on m68k, mips, powerpc,
    s390, sh, sparc, and x86-32.

    Signed-off-by: Arnd Bergmann

    Arnd Bergmann
     
  • Three ipc syscalls (mq_timedsend, mq_timedreceive and and semtimedop)
    take a timespec argument. After we move 32-bit architectures over to
    useing 64-bit time_t based syscalls, we need seperate entry points for
    the old 32-bit based interfaces.

    This changes the #ifdef guards for the existing 32-bit compat syscalls
    to check for CONFIG_COMPAT_32BIT_TIME instead, which will then be
    enabled on all existing 32-bit architectures.

    Signed-off-by: Arnd Bergmann

    Arnd Bergmann
     
  • This is a preparatation for changing over __kernel_timespec to 64-bit
    times, which involves assigning new system call numbers for mq_timedsend(),
    mq_timedreceive() and semtimedop() for compatibility with future y2038
    proof user space.

    The existing ABIs will remain available through compat code.

    Signed-off-by: Arnd Bergmann

    Arnd Bergmann
     
  • The shmid64_ds/semid64_ds/msqid64_ds data structures have been extended
    to contain extra fields for storing the upper bits of the time stamps,
    this patch does the other half of the job and and fills the new fields on
    32-bit architectures as well as 32-bit tasks running on a 64-bit kernel
    in compat mode.

    There should be no change for native 64-bit tasks.

    Signed-off-by: Arnd Bergmann

    Arnd Bergmann
     
  • In some places, we still used get_seconds() instead of
    ktime_get_real_seconds(), and I'm changing the remaining ones now to
    all use ktime_get_real_seconds() so we use the full available range for
    timestamps instead of overflowing the 'unsigned long' return value in
    year 2106 on 32-bit kernels.

    Signed-off-by: Arnd Bergmann

    Arnd Bergmann
     

14 Apr, 2018

1 commit

  • syzbot reported a use-after-free of shm_file_data(file)->file->f_op in
    shm_get_unmapped_area(), called via sys_remap_file_pages().

    Unfortunately it couldn't generate a reproducer, but I found a bug which
    I think caused it. When remap_file_pages() is passed a full System V
    shared memory segment, the memory is first unmapped, then a new map is
    created using the ->vm_file. Between these steps, the shm ID can be
    removed and reused for a new shm segment. But, shm_mmap() only checks
    whether the ID is currently valid before calling the underlying file's
    ->mmap(); it doesn't check whether it was reused. Thus it can use the
    wrong underlying file, one that was already freed.

    Fix this by making the "outer" shm file (the one that gets put in
    ->vm_file) hold a reference to the real shm file, and by making
    __shm_open() require that the file associated with the shm ID matches
    the one associated with the "outer" file.

    Taking the reference to the real shm file is needed to fully solve the
    problem, since otherwise sfd->file could point to a freed file, which
    then could be reallocated for the reused shm ID, causing the wrong shm
    segment to be mapped (and without the required permission checks).

    Commit 1ac0b6dec656 ("ipc/shm: handle removed segments gracefully in
    shm_mmap()") almost fixed this bug, but it didn't go far enough because
    it didn't consider the case where the shm ID is reused.

    The following program usually reproduces this bug:

    #include
    #include
    #include
    #include

    int main()
    {
    int is_parent = (fork() != 0);
    srand(getpid());
    for (;;) {
    int id = shmget(0xF00F, 4096, IPC_CREAT|0700);
    if (is_parent) {
    void *addr = shmat(id, NULL, 0);
    usleep(rand() % 50);
    while (!syscall(__NR_remap_file_pages, addr, 4096, 0, 0, 0));
    } else {
    usleep(rand() % 50);
    shmctl(id, IPC_RMID, NULL);
    }
    }
    }

    It causes the following NULL pointer dereference due to a 'struct file'
    being used while it's being freed. (I couldn't actually get a KASAN
    use-after-free splat like in the syzbot report. But I think it's
    possible with this bug; it would just take a more extraordinary race...)

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000058
    PGD 0 P4D 0
    Oops: 0000 [#1] SMP NOPTI
    CPU: 9 PID: 258 Comm: syz_ipc Not tainted 4.16.0-05140-gf8cf2f16a7c95 #189
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-20171110_100015-anatol 04/01/2014
    RIP: 0010:d_inode include/linux/dcache.h:519 [inline]
    RIP: 0010:touch_atime+0x25/0xd0 fs/inode.c:1724
    [...]
    Call Trace:
    file_accessed include/linux/fs.h:2063 [inline]
    shmem_mmap+0x25/0x40 mm/shmem.c:2149
    call_mmap include/linux/fs.h:1789 [inline]
    shm_mmap+0x34/0x80 ipc/shm.c:465
    call_mmap include/linux/fs.h:1789 [inline]
    mmap_region+0x309/0x5b0 mm/mmap.c:1712
    do_mmap+0x294/0x4a0 mm/mmap.c:1483
    do_mmap_pgoff include/linux/mm.h:2235 [inline]
    SYSC_remap_file_pages mm/mmap.c:2853 [inline]
    SyS_remap_file_pages+0x232/0x310 mm/mmap.c:2769
    do_syscall_64+0x64/0x1a0 arch/x86/entry/common.c:287
    entry_SYSCALL_64_after_hwframe+0x42/0xb7

    [ebiggers@google.com: add comment]
    Link: http://lkml.kernel.org/r/20180410192850.235835-1-ebiggers3@gmail.com
    Link: http://lkml.kernel.org/r/20180409043039.28915-1-ebiggers3@gmail.com
    Reported-by: syzbot+d11f321e7f1923157eac80aa990b446596f46439@syzkaller.appspotmail.com
    Fixes: c8d78c1823f4 ("mm: replace remap_file_pages() syscall with emulation")
    Signed-off-by: Eric Biggers
    Acked-by: Kirill A. Shutemov
    Acked-by: Davidlohr Bueso
    Cc: Manfred Spraul
    Cc: "Eric W . Biederman"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Biggers
     

12 Apr, 2018

5 commits

  • This was added by the recent "ipc/shm.c: add split function to
    shm_vm_ops", but it is not necessary.

    Reviewed-by: Mike Kravetz
    Cc: Laurent Dufour
    Cc: Dan Williams
    Cc: Michal Hocko
    Cc: Davidlohr Bueso
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • There is a permission discrepancy when consulting msq ipc object
    metadata between /proc/sysvipc/msg (0444) and the MSG_STAT shmctl
    command. The later does permission checks for the object vs S_IRUGO.
    As such there can be cases where EACCESS is returned via syscall but the
    info is displayed anyways in the procfs files.

    While this might have security implications via info leaking (albeit no
    writing to the msq metadata), this behavior goes way back and showing
    all the objects regardless of the permissions was most likely an
    overlook - so we are stuck with it. Furthermore, modifying either the
    syscall or the procfs file can cause userspace programs to break (ie
    ipcs). Some applications require getting the procfs info (without root
    privileges) and can be rather slow in comparison with a syscall -- up to
    500x in some reported cases for shm.

    This patch introduces a new MSG_STAT_ANY command such that the msq ipc
    object permissions are ignored, and only audited instead. In addition,
    I've left the lsm security hook checks in place, as if some policy can
    block the call, then the user has no other choice than just parsing the
    procfs file.

    Link: http://lkml.kernel.org/r/20180215162458.10059-4-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Reported-by: Robert Kettler
    Cc: Eric W. Biederman
    Cc: Kees Cook
    Cc: Manfred Spraul
    Cc: Michael Kerrisk
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • There is a permission discrepancy when consulting shm ipc object
    metadata between /proc/sysvipc/sem (0444) and the SEM_STAT semctl
    command. The later does permission checks for the object vs S_IRUGO.
    As such there can be cases where EACCESS is returned via syscall but the
    info is displayed anyways in the procfs files.

    While this might have security implications via info leaking (albeit no
    writing to the sma metadata), this behavior goes way back and showing
    all the objects regardless of the permissions was most likely an
    overlook - so we are stuck with it. Furthermore, modifying either the
    syscall or the procfs file can cause userspace programs to break (ie
    ipcs). Some applications require getting the procfs info (without root
    privileges) and can be rather slow in comparison with a syscall -- up to
    500x in some reported cases for shm.

    This patch introduces a new SEM_STAT_ANY command such that the sem ipc
    object permissions are ignored, and only audited instead. In addition,
    I've left the lsm security hook checks in place, as if some policy can
    block the call, then the user has no other choice than just parsing the
    procfs file.

    Link: http://lkml.kernel.org/r/20180215162458.10059-3-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Reported-by: Robert Kettler
    Cc: Eric W. Biederman
    Cc: Kees Cook
    Cc: Manfred Spraul
    Cc: Michael Kerrisk
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • Patch series "sysvipc: introduce STAT_ANY commands", v2.

    The following patches adds the discussed (see [1]) new command for shm
    as well as for sems and msq as they are subject to the same
    discrepancies for ipc object permission checks between the syscall and
    via procfs. These new commands are justified in that (1) we are stuck
    with this semantics as changing syscall and procfs can break userland;
    and (2) some users can benefit from performance (for large amounts of
    shm segments, for example) from not having to parse the procfs
    interface.

    Once merged, I will submit the necesary manpage updates. But I'm thinking
    something like:

    : diff --git a/man2/shmctl.2 b/man2/shmctl.2
    : index 7bb503999941..bb00bbe21a57 100644
    : --- a/man2/shmctl.2
    : +++ b/man2/shmctl.2
    : @@ -41,6 +41,7 @@
    : .\" 2005-04-25, mtk -- noted aberrant Linux behavior w.r.t. new
    : .\" attaches to a segment that has already been marked for deletion.
    : .\" 2005-08-02, mtk: Added IPC_INFO, SHM_INFO, SHM_STAT descriptions.
    : +.\" 2018-02-13, dbueso: Added SHM_STAT_ANY description.
    : .\"
    : .TH SHMCTL 2 2017-09-15 "Linux" "Linux Programmer's Manual"
    : .SH NAME
    : @@ -242,6 +243,18 @@ However, the
    : argument is not a segment identifier, but instead an index into
    : the kernel's internal array that maintains information about
    : all shared memory segments on the system.
    : +.TP
    : +.BR SHM_STAT_ANY " (Linux-specific)"
    : +Return a
    : +.I shmid_ds
    : +structure as for
    : +.BR SHM_STAT .
    : +However, the
    : +.I shm_perm.mode
    : +is not checked for read access for
    : +.IR shmid ,
    : +resembing the behaviour of
    : +/proc/sysvipc/shm.
    : .PP
    : The caller can prevent or allow swapping of a shared
    : memory segment with the following \fIcmd\fP values:
    : @@ -287,7 +300,7 @@ operation returns the index of the highest used entry in the
    : kernel's internal array recording information about all
    : shared memory segments.
    : (This information can be used with repeated
    : -.B SHM_STAT
    : +.B SHM_STAT/SHM_STAT_ANY
    : operations to obtain information about all shared memory segments
    : on the system.)
    : A successful
    : @@ -328,7 +341,7 @@ isn't accessible.
    : \fIshmid\fP is not a valid identifier, or \fIcmd\fP
    : is not a valid command.
    : Or: for a
    : -.B SHM_STAT
    : +.B SHM_STAT/SHM_STAT_ANY
    : operation, the index value specified in
    : .I shmid
    : referred to an array slot that is currently unused.

    This patch (of 3):

    There is a permission discrepancy when consulting shm ipc object metadata
    between /proc/sysvipc/shm (0444) and the SHM_STAT shmctl command. The
    later does permission checks for the object vs S_IRUGO. As such there can
    be cases where EACCESS is returned via syscall but the info is displayed
    anyways in the procfs files.

    While this might have security implications via info leaking (albeit no
    writing to the shm metadata), this behavior goes way back and showing all
    the objects regardless of the permissions was most likely an overlook - so
    we are stuck with it. Furthermore, modifying either the syscall or the
    procfs file can cause userspace programs to break (ie ipcs). Some
    applications require getting the procfs info (without root privileges) and
    can be rather slow in comparison with a syscall -- up to 500x in some
    reported cases.

    This patch introduces a new SHM_STAT_ANY command such that the shm ipc
    object permissions are ignored, and only audited instead. In addition,
    I've left the lsm security hook checks in place, as if some policy can
    block the call, then the user has no other choice than just parsing the
    procfs file.

    [1] https://lkml.org/lkml/2017/12/19/220

    Link: http://lkml.kernel.org/r/20180215162458.10059-2-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Acked-by: Michal Hocko
    Cc: Michael Kerrisk
    Cc: Manfred Spraul
    Cc: Eric W. Biederman
    Cc: Kees Cook
    Cc: Robert Kettler
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • Move the proc_mkdir() call within the sysvipc subsystem such that we
    avoid polluting proc_root_init() with petty cpp.

    [dave@stgolabs.net: contributed changelog]
    Link: http://lkml.kernel.org/r/20180216161732.GA10297@avx2
    Signed-off-by: Alexey Dobriyan
    Reviewed-by: Andrew Morton
    Acked-by: Davidlohr Bueso
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

04 Apr, 2018

1 commit

  • Pull namespace updates from Eric Biederman:
    "There was a lot of work this cycle fixing bugs that were discovered
    after the merge window and getting everything ready where we can
    reasonably support fully unprivileged fuse. The bug fixes you already
    have and much of the unprivileged fuse work is coming in via other
    trees.

    Still left for fully unprivileged fuse is figuring out how to cleanly
    handle .set_acl and .get_acl in the legacy case, and properly handling
    of evm xattrs on unprivileged mounts.

    Included in the tree is a cleanup from Alexely that replaced a linked
    list with a statically allocated fix sized array for the pid caches,
    which simplifies and speeds things up.

    Then there is are some cleanups and fixes for the ipc namespace. The
    motivation was that in reviewing other code it was discovered that
    access ipc objects from different pid namespaces recorded pids in such
    a way that when asked the wrong pids were returned. In the worst case
    there has been a measured 30% performance impact for sysvipc
    semaphores. Other test cases showed no measurable performance impact.
    Manfred Spraul and Davidlohr Bueso who tend to work on sysvipc
    performance both gave the nod that this is good enough.

    Casey Schaufler and James Morris have given their approval to the LSM
    side of the changes.

    I simplified the types and the code dealing with sysvipc to pass just
    kern_ipc_perm for all three types of ipc. Which reduced the header
    dependencies throughout the kernel and simplified the lsm code.

    Which let me work on the pid fixes without having to worry about
    trivial changes causing complete kernel recompiles"

    * 'userns-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    ipc/shm: Fix pid freeing.
    ipc/shm: fix up for struct file no longer being available in shm.h
    ipc/smack: Tidy up from the change in type of the ipc security hooks
    ipc: Directly call the security hook in ipc_ops.associate
    ipc/sem: Fix semctl(..., GETPID, ...) between pid namespaces
    ipc/msg: Fix msgctl(..., IPC_STAT, ...) between pid namespaces
    ipc/shm: Fix shmctl(..., IPC_STAT, ...) between pid namespaces.
    ipc/util: Helpers for making the sysvipc operations pid namespace aware
    ipc: Move IPCMNI from include/ipc.h into ipc/util.h
    msg: Move struct msg_queue into ipc/msg.c
    shm: Move struct shmid_kernel into ipc/shm.c
    sem: Move struct sem and struct sem_array into ipc/sem.c
    msg/security: Pass kern_ipc_perm not msg_queue into the msg_queue security hooks
    shm/security: Pass kern_ipc_perm not shmid_kernel into the shm security hooks
    sem/security: Pass kern_ipc_perm not sem_array into the sem security hooks
    pidns: simpler allocation of pid_* caches

    Linus Torvalds
     

03 Apr, 2018

1 commit

  • Pull removal of in-kernel calls to syscalls from Dominik Brodowski:
    "System calls are interaction points between userspace and the kernel.
    Therefore, system call functions such as sys_xyzzy() or
    compat_sys_xyzzy() should only be called from userspace via the
    syscall table, but not from elsewhere in the kernel.

    At least on 64-bit x86, it will likely be a hard requirement from
    v4.17 onwards to not call system call functions in the kernel: It is
    better to use use a different calling convention for system calls
    there, where struct pt_regs is decoded on-the-fly in a syscall wrapper
    which then hands processing over to the actual syscall function. This
    means that only those parameters which are actually needed for a
    specific syscall are passed on during syscall entry, instead of
    filling in six CPU registers with random user space content all the
    time (which may cause serious trouble down the call chain). Those
    x86-specific patches will be pushed through the x86 tree in the near
    future.

    Moreover, rules on how data may be accessed may differ between kernel
    data and user data. This is another reason why calling sys_xyzzy() is
    generally a bad idea, and -- at most -- acceptable in arch-specific
    code.

    This patchset removes all in-kernel calls to syscall functions in the
    kernel with the exception of arch/. On top of this, it cleans up the
    three places where many syscalls are referenced or prototyped, namely
    kernel/sys_ni.c, include/linux/syscalls.h and include/linux/compat.h"

    * 'syscalls-next' of git://git.kernel.org/pub/scm/linux/kernel/git/brodo/linux: (109 commits)
    bpf: whitelist all syscalls for error injection
    kernel/sys_ni: remove {sys_,sys_compat} from cond_syscall definitions
    kernel/sys_ni: sort cond_syscall() entries
    syscalls/x86: auto-create compat_sys_*() prototypes
    syscalls: sort syscall prototypes in include/linux/compat.h
    net: remove compat_sys_*() prototypes from net/compat.h
    syscalls: sort syscall prototypes in include/linux/syscalls.h
    kexec: move sys_kexec_load() prototype to syscalls.h
    x86/sigreturn: use SYSCALL_DEFINE0
    x86: fix sys_sigreturn() return type to be long, not unsigned long
    x86/ioport: add ksys_ioperm() helper; remove in-kernel calls to sys_ioperm()
    mm: add ksys_readahead() helper; remove in-kernel calls to sys_readahead()
    mm: add ksys_mmap_pgoff() helper; remove in-kernel calls to sys_mmap_pgoff()
    mm: add ksys_fadvise64_64() helper; remove in-kernel call to sys_fadvise64_64()
    fs: add ksys_fallocate() wrapper; remove in-kernel calls to sys_fallocate()
    fs: add ksys_p{read,write}64() helpers; remove in-kernel calls to syscalls
    fs: add ksys_truncate() wrapper; remove in-kernel calls to sys_truncate()
    fs: add ksys_sync_file_range helper(); remove in-kernel calls to syscall
    kernel: add ksys_setsid() helper; remove in-kernel call to sys_setsid()
    kernel: add ksys_unshare() helper; remove in-kernel calls to sys_unshare()
    ...

    Linus Torvalds