09 Jun, 2022

1 commit

  • [ Upstream commit d60c4d01a98bc1942dba6e3adc02031f5519f94b ]

    When running the stress-ng clone benchmark with multiple testing threads,
    it was found that there were significant spinlock contention in sget_fc().
    The contended spinlock was the sb_lock. It is under heavy contention
    because the following code in the critcal section of sget_fc():

    hlist_for_each_entry(old, &fc->fs_type->fs_supers, s_instances) {
    if (test(old, fc))
    goto share_extant_sb;
    }

    After testing with added instrumentation code, it was found that the
    benchmark could generate thousands of ipc namespaces with the
    corresponding number of entries in the mqueue's fs_supers list where the
    namespaces are the key for the search. This leads to excessive time in
    scanning the list for a match.

    Looking back at the mqueue calling sequence leading to sget_fc():

    mq_init_ns()
    => mq_create_mount()
    => fc_mount()
    => vfs_get_tree()
    => mqueue_get_tree()
    => get_tree_keyed()
    => vfs_get_super()
    => sget_fc()

    Currently, mq_init_ns() is the only mqueue function that will indirectly
    call mqueue_get_tree() with a newly allocated ipc namespace as the key for
    searching. As a result, there will never be a match with the exising ipc
    namespaces stored in the mqueue's fs_supers list.

    So using get_tree_keyed() to do an existing ipc namespace search is just a
    waste of time. Instead, we could use get_tree_nodev() to eliminate the
    useless search. By doing so, we can greatly reduce the sb_lock hold time
    and avoid the spinlock contention problem in case a large number of ipc
    namespaces are present.

    Of course, if the code is modified in the future to allow
    mqueue_get_tree() to be called with an existing ipc namespace instead of a
    new one, we will have to use get_tree_keyed() in this case.

    The following stress-ng clone benchmark command was run on a 2-socket
    48-core Intel system:

    ./stress-ng --clone 32 --verbose --oomable --metrics-brief -t 20

    The "bogo ops/s" increased from 5948.45 before patch to 9137.06 after
    patch. This is an increase of 54% in performance.

    Link: https://lkml.kernel.org/r/20220121172315.19652-1-longman@redhat.com
    Fixes: 935c6912b198 ("ipc: Convert mqueue fs to fs_context")
    Signed-off-by: Waiman Long
    Cc: Al Viro
    Cc: David Howells
    Cc: Manfred Spraul
    Cc: Davidlohr Bueso
    Signed-off-by: Andrew Morton
    Signed-off-by: Sasha Levin

    Waiman Long
     

09 Feb, 2022

1 commit

  • commit 520ba724061cef59763e2b6f5b26e8387c2e5822 upstream.

    We can't call kvfree() with a spin lock held, so defer it.

    Link: https://lkml.kernel.org/r/20211223031207.556189-1-chi.minghao@zte.com.cn
    Fixes: fc37a3b8b438 ("[PATCH] ipc sem: use kvmalloc for sem_undo allocation")
    Reported-by: Zeal Robot
    Signed-off-by: Minghao Chi
    Reviewed-by: Shakeel Butt
    Reviewed-by: Manfred Spraul
    Cc: Arnd Bergmann
    Cc: Yang Guang
    Cc: Davidlohr Bueso
    Cc: Randy Dunlap
    Cc: Bhaskar Chowdhury
    Cc: Vasily Averin
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Minghao Chi
     

25 Nov, 2021

2 commits

  • commit 85b6d24646e4125c591639841169baa98a2da503 upstream.

    Currently, the exit_shm() function not designed to work properly when
    task->sysvshm.shm_clist holds shm objects from different IPC namespaces.

    This is a real pain when sysctl kernel.shm_rmid_forced = 1, because it
    leads to use-after-free (reproducer exists).

    This is an attempt to fix the problem by extending exit_shm mechanism to
    handle shm's destroy from several IPC ns'es.

    To achieve that we do several things:

    1. add a namespace (non-refcounted) pointer to the struct shmid_kernel

    2. during new shm object creation (newseg()/shmget syscall) we
    initialize this pointer by current task IPC ns

    3. exit_shm() fully reworked such that it traverses over all shp's in
    task->sysvshm.shm_clist and gets IPC namespace not from current task
    as it was before but from shp's object itself, then call
    shm_destroy(shp, ns).

    Note: We need to be really careful here, because as it was said before
    (1), our pointer to IPC ns non-refcnt'ed. To be on the safe side we
    using special helper get_ipc_ns_not_zero() which allows to get IPC ns
    refcounter only if IPC ns not in the "state of destruction".

    Q/A

    Q: Why can we access shp->ns memory using non-refcounted pointer?
    A: Because shp object lifetime is always shorther than IPC namespace
    lifetime, so, if we get shp object from the task->sysvshm.shm_clist
    while holding task_lock(task) nobody can steal our namespace.

    Q: Does this patch change semantics of unshare/setns/clone syscalls?
    A: No. It's just fixes non-covered case when process may leave IPC
    namespace without getting task->sysvshm.shm_clist list cleaned up.

    Link: https://lkml.kernel.org/r/67bb03e5-f79c-1815-e2bf-949c67047418@colorfullife.com
    Link: https://lkml.kernel.org/r/20211109151501.4921-1-manfred@colorfullife.com
    Fixes: ab602f79915 ("shm: make exit_shm work proportional to task activity")
    Co-developed-by: Manfred Spraul
    Signed-off-by: Manfred Spraul
    Signed-off-by: Alexander Mikhalitsyn
    Cc: "Eric W. Biederman"
    Cc: Davidlohr Bueso
    Cc: Greg KH
    Cc: Andrei Vagin
    Cc: Pavel Tikhomirov
    Cc: Vasily Averin
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Alexander Mikhalitsyn
     
  • commit 126e8bee943e9926238c891e2df5b5573aee76bc upstream.

    Patch series "shm: shm_rmid_forced feature fixes".

    Some time ago I met kernel crash after CRIU restore procedure,
    fortunately, it was CRIU restore, so, I had dump files and could do
    restore many times and crash reproduced easily. After some
    investigation I've constructed the minimal reproducer. It was found
    that it's use-after-free and it happens only if sysctl
    kernel.shm_rmid_forced = 1.

    The key of the problem is that the exit_shm() function not handles shp's
    object destroy when task->sysvshm.shm_clist contains items from
    different IPC namespaces. In most cases this list will contain only
    items from one IPC namespace.

    How can this list contain object from different namespaces? The
    exit_shm() function is designed to clean up this list always when
    process leaves IPC namespace. But we made a mistake a long time ago and
    did not add a exit_shm() call into the setns() syscall procedures.

    The first idea was just to add this call to setns() syscall but it
    obviously changes semantics of setns() syscall and that's
    userspace-visible change. So, I gave up on this idea.

    The first real attempt to address the issue was just to omit forced
    destroy if we meet shp object not from current task IPC namespace [1].
    But that was not the best idea because task->sysvshm.shm_clist was
    protected by rwsem which belongs to current task IPC namespace. It
    means that list corruption may occur.

    Second approach is just extend exit_shm() to properly handle shp's from
    different IPC namespaces [2]. This is really non-trivial thing, I've
    put a lot of effort into that but not believed that it's possible to
    make it fully safe, clean and clear.

    Thanks to the efforts of Manfred Spraul working an elegant solution was
    designed. Thanks a lot, Manfred!

    Eric also suggested the way to address the issue in ("[RFC][PATCH] shm:
    In shm_exit destroy all created and never attached segments") Eric's
    idea was to maintain a list of shm_clists one per IPC namespace, use
    lock-less lists. But there is some extra memory consumption-related
    concerns.

    An alternative solution which was suggested by me was implemented in
    ("shm: reset shm_clist on setns but omit forced shm destroy"). The idea
    is pretty simple, we add exit_shm() syscall to setns() but DO NOT
    destroy shm segments even if sysctl kernel.shm_rmid_forced = 1, we just
    clean up the task->sysvshm.shm_clist list.

    This chages semantics of setns() syscall a little bit but in comparision
    to the "naive" solution when we just add exit_shm() without any special
    exclusions this looks like a safer option.

    [1] https://lkml.org/lkml/2021/7/6/1108
    [2] https://lkml.org/lkml/2021/7/14/736

    This patch (of 2):

    Let's produce a warning if we trying to remove non-existing IPC object
    from IPC namespace kht/idr structures.

    This allows us to catch possible bugs when the ipc_rmid() function was
    called with inconsistent struct ipc_ids*, struct kern_ipc_perm*
    arguments.

    Link: https://lkml.kernel.org/r/20211027224348.611025-1-alexander.mikhalitsyn@virtuozzo.com
    Link: https://lkml.kernel.org/r/20211027224348.611025-2-alexander.mikhalitsyn@virtuozzo.com
    Co-developed-by: Manfred Spraul
    Signed-off-by: Manfred Spraul
    Signed-off-by: Alexander Mikhalitsyn
    Cc: "Eric W. Biederman"
    Cc: Davidlohr Bueso
    Cc: Greg KH
    Cc: Andrei Vagin
    Cc: Pavel Tikhomirov
    Cc: Vasily Averin
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Alexander Mikhalitsyn
     

15 Sep, 2021

1 commit

  • Linus proposes to revert an accounting for sops objects in
    do_semtimedop() because it's really just a temporary buffer
    for a single semtimedop() system call.

    This object can consume up to 2 pages, syscall is sleeping
    one, size and duration can be controlled by user, and this
    allocation can be repeated by many thread at the same time.

    However Shakeel Butt pointed that there are much more popular
    objects with the same life time and similar memory
    consumption, the accounting of which was decided to be
    rejected for performance reasons.

    Considering at least 2 pages for task_struct and 2 pages for
    the kernel stack, a back of the envelope calculation gives a
    footprint amplification of > 2 (from the
    PoV of excessive (ab)use, fine-grained accounting seems to be
    currently unfeasible due to performance impact).

    Link: https://lore.kernel.org/lkml/90e254df-0dfe-f080-011e-b7c53ee7fd20@virtuozzo.com/
    Fixes: 18319498fdd4 ("memcg: enable accounting of ipc resources")
    Signed-off-by: Vasily Averin
    Acked-by: Michal Hocko
    Reviewed-by: Michal Koutný
    Acked-by: Shakeel Butt
    Signed-off-by: Linus Torvalds

    Vasily Averin
     

10 Sep, 2021

1 commit

  • Pull ARM development updates from Russell King:

    - Rename "mod_init" and "mod_exit" so that initcall debug output is
    actually useful (Randy Dunlap)

    - Update maintainers entries for linux-arm-kernel to indicate it is
    moderated for non-subscribers (Randy Dunlap)

    - Move install rules to arch/arm/Makefile (Masahiro Yamada)

    - Drop unnecessary ARCH_NR_GPIOS definition (Linus Walleij)

    - Don't warn about atags_to_fdt() stack size (David Heidelberg)

    - Speed up unaligned copy_{from,to}_kernel_nofault (Arnd Bergmann)

    - Get rid of set_fs() usage (Arnd Bergmann)

    - Remove checks for GCC prior to v4.6 (Geert Uytterhoeven)

    * tag 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm:
    ARM: 9118/1: div64: Remove always-true __div64_const32_is_OK() duplicate
    ARM: 9117/1: asm-generic: div64: Remove always-true __div64_const32_is_OK()
    ARM: 9116/1: unified: Remove check for gcc < 4
    ARM: 9110/1: oabi-compat: fix oabi epoll sparse warning
    ARM: 9113/1: uaccess: remove set_fs() implementation
    ARM: 9112/1: uaccess: add __{get,put}_kernel_nofault
    ARM: 9111/1: oabi-compat: rework fcntl64() emulation
    ARM: 9114/1: oabi-compat: rework sys_semtimedop emulation
    ARM: 9108/1: oabi-compat: rework epoll_wait/epoll_pwait emulation
    ARM: 9107/1: syscall: always store thread_info->abi_syscall
    ARM: 9109/1: oabi-compat: add epoll_pwait handler
    ARM: 9106/1: traps: use get_kernel_nofault instead of set_fs()
    ARM: 9115/1: mm/maccess: fix unaligned copy_{from,to}_kernel_nofault
    ARM: 9105/1: atags_to_fdt: don't warn about stack size
    ARM: 9103/1: Drop ARCH_NR_GPIOS definition
    ARM: 9102/1: move theinstall rules to arch/arm/Makefile
    ARM: 9100/1: MAINTAINERS: mark all linux-arm-kernel@infradead list as moderated
    ARM: 9099/1: crypto: rename 'mod_init' & 'mod_exit' functions to be module-specific

    Linus Torvalds
     

09 Sep, 2021

2 commits

  • Merge more updates from Andrew Morton:
    "147 patches, based on 7d2a07b769330c34b4deabeed939325c77a7ec2f.

    Subsystems affected by this patch series: mm (memory-hotplug, rmap,
    ioremap, highmem, cleanups, secretmem, kfence, damon, and vmscan),
    alpha, percpu, procfs, misc, core-kernel, MAINTAINERS, lib,
    checkpatch, epoll, init, nilfs2, coredump, fork, pids, criu, kconfig,
    selftests, ipc, and scripts"

    * emailed patches from Andrew Morton : (94 commits)
    scripts: check_extable: fix typo in user error message
    mm/workingset: correct kernel-doc notations
    ipc: replace costly bailout check in sysvipc_find_ipc()
    selftests/memfd: remove unused variable
    Kconfig.debug: drop selecting non-existing HARDLOCKUP_DETECTOR_ARCH
    configs: remove the obsolete CONFIG_INPUT_POLLDEV
    prctl: allow to setup brk for et_dyn executables
    pid: cleanup the stale comment mentioning pidmap_init().
    kernel/fork.c: unexport get_{mm,task}_exe_file
    coredump: fix memleak in dump_vma_snapshot()
    fs/coredump.c: log if a core dump is aborted due to changed file permissions
    nilfs2: use refcount_dec_and_lock() to fix potential UAF
    nilfs2: fix memory leak in nilfs_sysfs_delete_snapshot_group
    nilfs2: fix memory leak in nilfs_sysfs_create_snapshot_group
    nilfs2: fix memory leak in nilfs_sysfs_delete_##name##_group
    nilfs2: fix memory leak in nilfs_sysfs_create_##name##_group
    nilfs2: fix NULL pointer in nilfs_##name##_attr_release
    nilfs2: fix memory leak in nilfs_sysfs_create_device_group
    trap: cleanup trap_init()
    init: move usermodehelper_enable() to populate_rootfs()
    ...

    Linus Torvalds
     
  • sysvipc_find_ipc() was left with a costly way to check if the offset
    position fed to it is bigger than the total number of IPC IDs in use. So
    much so that the time it takes to iterate over /proc/sysvipc/* files grows
    exponentially for a custom benchmark that creates "N" SYSV shm segments
    and then times the read of /proc/sysvipc/shm (milliseconds):

    12 msecs to read 1024 segs from /proc/sysvipc/shm
    18 msecs to read 2048 segs from /proc/sysvipc/shm
    65 msecs to read 4096 segs from /proc/sysvipc/shm
    325 msecs to read 8192 segs from /proc/sysvipc/shm
    1303 msecs to read 16384 segs from /proc/sysvipc/shm
    5182 msecs to read 32768 segs from /proc/sysvipc/shm

    The root problem lies with the loop that computes the total amount of ids
    in use to check if the "pos" feeded to sysvipc_find_ipc() grew bigger than
    "ids->in_use". That is a quite inneficient way to get to the maximum
    index in the id lookup table, specially when that value is already
    provided by struct ipc_ids.max_idx.

    This patch follows up on the optimization introduced via commit
    15df03c879836 ("sysvipc: make get_maxid O(1) again") and gets rid of the
    aforementioned costly loop replacing it by a simpler checkpoint based on
    ipc_get_maxidx() returned value, which allows for a smooth linear increase
    in time complexity for the same custom benchmark:

    2 msecs to read 1024 segs from /proc/sysvipc/shm
    2 msecs to read 2048 segs from /proc/sysvipc/shm
    4 msecs to read 4096 segs from /proc/sysvipc/shm
    9 msecs to read 8192 segs from /proc/sysvipc/shm
    19 msecs to read 16384 segs from /proc/sysvipc/shm
    39 msecs to read 32768 segs from /proc/sysvipc/shm

    Link: https://lkml.kernel.org/r/20210809203554.1562989-1-aquini@redhat.com
    Signed-off-by: Rafael Aquini
    Acked-by: Davidlohr Bueso
    Acked-by: Manfred Spraul
    Cc: Waiman Long
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael Aquini
     

04 Sep, 2021

2 commits

  • When user creates IPC objects it forces kernel to allocate memory for
    these long-living objects.

    It makes sense to account them to restrict the host's memory consumption
    from inside the memcg-limited container.

    This patch enables accounting for IPC shared memory segments, messages
    semaphores and semaphore's undo lists.

    Link: https://lkml.kernel.org/r/d6507b06-4df6-78f8-6c54-3ae86e3b5339@virtuozzo.com
    Signed-off-by: Vasily Averin
    Reviewed-by: Shakeel Butt
    Cc: Alexander Viro
    Cc: Alexey Dobriyan
    Cc: Andrei Vagin
    Cc: Borislav Petkov
    Cc: Borislav Petkov
    Cc: Christian Brauner
    Cc: Dmitry Safonov
    Cc: "Eric W. Biederman"
    Cc: Greg Kroah-Hartman
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: "J. Bruce Fields"
    Cc: Jeff Layton
    Cc: Jens Axboe
    Cc: Jiri Slaby
    Cc: Johannes Weiner
    Cc: Kirill Tkhai
    Cc: Michal Hocko
    Cc: Oleg Nesterov
    Cc: Roman Gushchin
    Cc: Serge Hallyn
    Cc: Tejun Heo
    Cc: Thomas Gleixner
    Cc: Vladimir Davydov
    Cc: Yutian Yang
    Cc: Zefan Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vasily Averin
     
  • Container admin can create new namespaces and force kernel to allocate up
    to several pages of memory for the namespaces and its associated
    structures.

    Net and uts namespaces have enabled accounting for such allocations. It
    makes sense to account for rest ones to restrict the host's memory
    consumption from inside the memcg-limited container.

    Link: https://lkml.kernel.org/r/5525bcbf-533e-da27-79b7-158686c64e13@virtuozzo.com
    Signed-off-by: Vasily Averin
    Acked-by: Serge Hallyn
    Acked-by: Christian Brauner
    Acked-by: Kirill Tkhai
    Reviewed-by: Shakeel Butt
    Cc: Alexander Viro
    Cc: Alexey Dobriyan
    Cc: Andrei Vagin
    Cc: Borislav Petkov
    Cc: Borislav Petkov
    Cc: Dmitry Safonov
    Cc: "Eric W. Biederman"
    Cc: Greg Kroah-Hartman
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: "J. Bruce Fields"
    Cc: Jeff Layton
    Cc: Jens Axboe
    Cc: Jiri Slaby
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Oleg Nesterov
    Cc: Roman Gushchin
    Cc: Tejun Heo
    Cc: Thomas Gleixner
    Cc: Vladimir Davydov
    Cc: Yutian Yang
    Cc: Zefan Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vasily Averin
     

20 Aug, 2021

1 commit

  • sys_oabi_semtimedop() is one of the last users of set_fs() on Arm. To
    remove this one, expose the internal code of the actual implementation
    that operates on a kernel pointer and call it directly after copying.

    There should be no measurable impact on the normal execution of this
    function, and it makes the overly long function a little shorter, which
    may help readability.

    While reworking the oabi version, make it behave a little more like
    the native one, using kvmalloc_array() and restructure the code
    flow in a similar way.

    The naming of __do_semtimedop() is not very good, I hope someone can
    come up with a better name.

    One regression was spotted by kernel test robot
    and fixed before the first mailing list submission.

    Acked-by: Christoph Hellwig
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Russell King (Oracle)

    Arnd Bergmann
     

03 Jul, 2021

1 commit

  • Merge more updates from Andrew Morton:
    "190 patches.

    Subsystems affected by this patch series: mm (hugetlb, userfaultfd,
    vmscan, kconfig, proc, z3fold, zbud, ras, mempolicy, memblock,
    migration, thp, nommu, kconfig, madvise, memory-hotplug, zswap,
    zsmalloc, zram, cleanups, kfence, and hmm), procfs, sysctl, misc,
    core-kernel, lib, lz4, checkpatch, init, kprobes, nilfs2, hfs,
    signals, exec, kcov, selftests, compress/decompress, and ipc"

    * emailed patches from Andrew Morton : (190 commits)
    ipc/util.c: use binary search for max_idx
    ipc/sem.c: use READ_ONCE()/WRITE_ONCE() for use_global_lock
    ipc: use kmalloc for msg_queue and shmid_kernel
    ipc sem: use kvmalloc for sem_undo allocation
    lib/decompressors: remove set but not used variabled 'level'
    selftests/vm/pkeys: exercise x86 XSAVE init state
    selftests/vm/pkeys: refill shadow register after implicit kernel write
    selftests/vm/pkeys: handle negative sys_pkey_alloc() return code
    selftests/vm/pkeys: fix alloc_random_pkey() to make it really, really random
    kcov: add __no_sanitize_coverage to fix noinstr for all architectures
    exec: remove checks in __register_bimfmt()
    x86: signal: don't do sas_ss_reset() until we are certain that sigframe won't be abandoned
    hfsplus: report create_date to kstat.btime
    hfsplus: remove unnecessary oom message
    nilfs2: remove redundant continue statement in a while-loop
    kprobes: remove duplicated strong free_insn_page in x86 and s390
    init: print out unknown kernel parameters
    checkpatch: do not complain about positive return values starting with EPOLL
    checkpatch: improve the indented label test
    checkpatch: scripts/spdxcheck.py now requires python3
    ...

    Linus Torvalds
     

02 Jul, 2021

4 commits

  • If semctl(), msgctl() and shmctl() are called with IPC_INFO, SEM_INFO,
    MSG_INFO or SHM_INFO, then the return value is the index of the highest
    used index in the kernel's internal array recording information about all
    SysV objects of the requested type for the current namespace. (This
    information can be used with repeated ..._STAT or ..._STAT_ANY operations
    to obtain information about all SysV objects on the system.)

    There is a cache for this value. But when the cache needs up be updated,
    then the highest used index is determined by looping over all possible
    values. With the introduction of IPCMNI_EXTEND_SHIFT, this could be a
    loop over 16 million entries. And due to /proc/sys/kernel/*next_id, the
    index values do not need to be consecutive.

    With , msgget(), msgctl(,IPC_RMID) in a
    loop, I have observed a performance increase of around factor 13000.

    As there is no get_last() function for idr structures: Implement a
    "get_last()" using a binary search.

    As far as I see, ipc is the only user that needs get_last(), thus
    implement it in ipc/util.c and not in a central location.

    [akpm@linux-foundation.org: tweak comment, fix typo]

    Link: https://lkml.kernel.org/r/20210425075208.11777-2-manfred@colorfullife.com
    Signed-off-by: Manfred Spraul
    Acked-by: Davidlohr Bueso
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • The patch solves three weaknesses in ipc/sem.c:

    1) The initial read of use_global_lock in sem_lock() is an intentional
    race. KCSAN detects these accesses and prints a warning.

    2) The code assumes that plain C read/writes are not mangled by the CPU
    or the compiler.

    3) The comment it sysvipc_sem_proc_show() was hard to understand: The
    rest of the comments in ipc/sem.c speaks about sem_perm.lock, and
    suddenly this function speaks about ipc_lock_object().

    To solve 1) and 2), use READ_ONCE()/WRITE_ONCE(). Plain C reads are used
    in code that owns sma->sem_perm.lock.

    The comment is updated to solve 3)

    [manfred@colorfullife.com: use READ_ONCE()/WRITE_ONCE() for use_global_lock]
    Link: https://lkml.kernel.org/r/20210627161919.3196-3-manfred@colorfullife.com

    Link: https://lkml.kernel.org/r/20210514175319.12195-1-manfred@colorfullife.com
    Signed-off-by: Manfred Spraul
    Reviewed-by: Paul E. McKenney
    Reviewed-by: Davidlohr Bueso
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • msg_queue and shmid_kernel are quite small objects, no need to use
    kvmalloc for them. mhocko@: "Both of them are 256B on most 64b systems."

    Previously these objects was allocated via ipc_alloc/ipc_rcu_alloc(),
    common function for several ipc objects. It had kvmalloc call inside().
    Later, this function went away and was finally replaced by direct kvmalloc
    call, and now we can use more suitable kmalloc/kfree for them.

    Link: https://lkml.kernel.org/r/0d0b6c9b-8af3-29d8-34e2-a565c53780f3@virtuozzo.com
    Reported-by: Alexey Dobriyan
    Signed-off-by: Vasily Averin
    Acked-by: Michal Hocko
    Reviewed-by: Shakeel Butt
    Acked-by: Roman Gushchin
    Cc: Johannes Weiner
    Cc: Vladimir Davydov
    Cc: Dmitry Safonov
    Cc: Davidlohr Bueso
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vasily Averin
     
  • Patch series "ipc: allocations cleanup", v2.

    Some ipc objects use the wrong allocation functions: small objects can use
    kmalloc(), and vice versa, potentially large objects can use kmalloc().

    This patch (of 2):

    Size of sem_undo can exceed one page and with the maximum possible nsems =
    32000 it can grow up to 64Kb. Let's switch its allocation to kvmalloc to
    avoid user-triggered disruptive actions like OOM killer in case of
    high-order memory shortage.

    User triggerable high order allocations are quite a problem on heavily
    fragmented systems. They can be a DoS vector.

    Link: https://lkml.kernel.org/r/ebc3ac79-3190-520d-81ce-22ad194986ec@virtuozzo.com
    Link: https://lkml.kernel.org/r/a6354fd9-2d55-2e63-dd4d-fa7dc1d11134@virtuozzo.com
    Signed-off-by: Vasily Averin
    Acked-by: Michal Hocko
    Reviewed-by: Shakeel Butt
    Acked-by: Roman Gushchin
    Cc: Alexey Dobriyan
    Cc: Davidlohr Bueso
    Cc: Dmitry Safonov
    Cc: Johannes Weiner
    Cc: Manfred Spraul
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vasily Averin
     

29 Jun, 2021

1 commit

  • Pull user namespace rlimit handling update from Eric Biederman:
    "This is the work mainly by Alexey Gladkov to limit rlimits to the
    rlimits of the user that created a user namespace, and to allow users
    to have stricter limits on the resources created within a user
    namespace."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    cred: add missing return error code when set_cred_ucounts() failed
    ucounts: Silence warning in dec_rlimit_ucounts
    ucounts: Set ucount_max to the largest positive value the type can hold
    kselftests: Add test to check for rlimit changes in different user namespaces
    Reimplement RLIMIT_MEMLOCK on top of ucounts
    Reimplement RLIMIT_SIGPENDING on top of ucounts
    Reimplement RLIMIT_MSGQUEUE on top of ucounts
    Reimplement RLIMIT_NPROC on top of ucounts
    Use atomic_t for ucounts reference counting
    Add a reference to ucounts for each cred
    Increase size of ucounts to atomic_long_t

    Linus Torvalds
     

23 May, 2021

1 commit

  • do_mq_timedreceive calls wq_sleep with a stack local address. The
    sender (do_mq_timedsend) uses this address to later call pipelined_send.

    This leads to a very hard to trigger race where a do_mq_timedreceive
    call might return and leave do_mq_timedsend to rely on an invalid
    address, causing the following crash:

    RIP: 0010:wake_q_add_safe+0x13/0x60
    Call Trace:
    __x64_sys_mq_timedsend+0x2a9/0x490
    do_syscall_64+0x80/0x680
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    RIP: 0033:0x7f5928e40343

    The race occurs as:

    1. do_mq_timedreceive calls wq_sleep with the address of `struct
    ext_wait_queue` on function stack (aliased as `ewq_addr` here) - it
    holds a valid `struct ext_wait_queue *` as long as the stack has not
    been overwritten.

    2. `ewq_addr` gets added to info->e_wait_q[RECV].list in wq_add, and
    do_mq_timedsend receives it via wq_get_first_waiter(info, RECV) to call
    __pipelined_op.

    3. Sender calls __pipelined_op::smp_store_release(&this->state,
    STATE_READY). Here is where the race window begins. (`this` is
    `ewq_addr`.)

    4. If the receiver wakes up now in do_mq_timedreceive::wq_sleep, it
    will see `state == STATE_READY` and break.

    5. do_mq_timedreceive returns, and `ewq_addr` is no longer guaranteed
    to be a `struct ext_wait_queue *` since it was on do_mq_timedreceive's
    stack. (Although the address may not get overwritten until another
    function happens to touch it, which means it can persist around for an
    indefinite time.)

    6. do_mq_timedsend::__pipelined_op() still believes `ewq_addr` is a
    `struct ext_wait_queue *`, and uses it to find a task_struct to pass to
    the wake_q_add_safe call. In the lucky case where nothing has
    overwritten `ewq_addr` yet, `ewq_addr->task` is the right task_struct.
    In the unlucky case, __pipelined_op::wake_q_add_safe gets handed a
    bogus address as the receiver's task_struct causing the crash.

    do_mq_timedsend::__pipelined_op() should not dereference `this` after
    setting STATE_READY, as the receiver counterpart is now free to return.
    Change __pipelined_op to call wake_q_add_safe on the receiver's
    task_struct returned by get_task_struct, instead of dereferencing `this`
    which sits on the receiver's stack.

    As Manfred pointed out, the race potentially also exists in
    ipc/msg.c::expunge_all and ipc/sem.c::wake_up_sem_queue_prepare. Fix
    those in the same way.

    Link: https://lkml.kernel.org/r/20210510102950.12551-1-varad.gautam@suse.com
    Fixes: c5b2cbdbdac563 ("ipc/mqueue.c: update/document memory barriers")
    Fixes: 8116b54e7e23ef ("ipc/sem.c: document and update memory barriers")
    Fixes: 0d97a82ba830d8 ("ipc/msg.c: update and document memory barriers")
    Signed-off-by: Varad Gautam
    Reported-by: Matthias von Faber
    Acked-by: Davidlohr Bueso
    Acked-by: Manfred Spraul
    Cc: Christian Brauner
    Cc: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Varad Gautam
     

07 May, 2021

2 commits

  • s/purpuse/purpose/

    Link: https://lkml.kernel.org/r/20210319221432.26631-1-unixbhaskar@gmail.com
    Signed-off-by: Bhaskar Chowdhury
    Acked-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bhaskar Chowdhury
     
  • s/runtine/runtime/
    s/AQUIRE/ACQUIRE/
    s/seperately/separately/
    s/wont/won\'t/
    s/succesfull/successful/

    Link: https://lkml.kernel.org/r/20210326022240.26375-1-unixbhaskar@gmail.com
    Signed-off-by: Bhaskar Chowdhury
    Acked-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bhaskar Chowdhury
     

01 May, 2021

2 commits

  • The rlimit counter is tied to uid in the user_namespace. This allows
    rlimit values to be specified in userns even if they are already
    globally exceeded by the user. However, the value of the previous
    user_namespaces cannot be exceeded.

    Changelog

    v11:
    * Fix issue found by lkp robot.

    v8:
    * Fix issues found by lkp-tests project.

    v7:
    * Keep only ucounts for RLIMIT_MEMLOCK checks instead of struct cred.

    v6:
    * Fix bug in hugetlb_file_setup() detected by trinity.

    Reported-by: kernel test robot
    Reported-by: kernel test robot
    Signed-off-by: Alexey Gladkov
    Link: https://lkml.kernel.org/r/970d50c70c71bfd4496e0e8d2a0a32feebebb350.1619094428.git.legion@kernel.org
    Signed-off-by: Eric W. Biederman

    Alexey Gladkov
     
  • The rlimit counter is tied to uid in the user_namespace. This allows
    rlimit values to be specified in userns even if they are already
    globally exceeded by the user. However, the value of the previous
    user_namespaces cannot be exceeded.

    Signed-off-by: Alexey Gladkov
    Link: https://lkml.kernel.org/r/2531f42f7884bbfee56a978040b3e0d25cdf6cde.1619094428.git.legion@kernel.org
    Signed-off-by: Eric W. Biederman

    Alexey Gladkov
     

24 Jan, 2021

3 commits

  • Extend some inode methods with an additional user namespace argument. A
    filesystem that is aware of idmapped mounts will receive the user
    namespace the mount has been marked with. This can be used for
    additional permission checking and also to enable filesystems to
    translate between uids and gids if they need to. We have implemented all
    relevant helpers in earlier patches.

    As requested we simply extend the exisiting inode method instead of
    introducing new ones. This is a little more code churn but it's mostly
    mechanical and doesnt't leave us with additional inode methods.

    Link: https://lore.kernel.org/r/20210121131959.646623-25-christian.brauner@ubuntu.com
    Cc: Christoph Hellwig
    Cc: David Howells
    Cc: Al Viro
    Cc: linux-fsdevel@vger.kernel.org
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Christian Brauner

    Christian Brauner
     
  • The various vfs_*() helpers are called by filesystems or by the vfs
    itself to perform core operations such as create, link, mkdir, mknod, rename,
    rmdir, tmpfile and unlink. Enable them to handle idmapped mounts. If the
    inode is accessed through an idmapped mount map it into the
    mount's user namespace and pass it down. Afterwards the checks and
    operations are identical to non-idmapped mounts. If the initial user
    namespace is passed nothing changes so non-idmapped mounts will see
    identical behavior as before.

    Link: https://lore.kernel.org/r/20210121131959.646623-15-christian.brauner@ubuntu.com
    Cc: Christoph Hellwig
    Cc: David Howells
    Cc: Al Viro
    Cc: linux-fsdevel@vger.kernel.org
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Christian Brauner

    Christian Brauner
     
  • The two helpers inode_permission() and generic_permission() are used by
    the vfs to perform basic permission checking by verifying that the
    caller is privileged over an inode. In order to handle idmapped mounts
    we extend the two helpers with an additional user namespace argument.
    On idmapped mounts the two helpers will make sure to map the inode
    according to the mount's user namespace and then peform identical
    permission checks to inode_permission() and generic_permission(). If the
    initial user namespace is passed nothing changes so non-idmapped mounts
    will see identical behavior as before.

    Link: https://lore.kernel.org/r/20210121131959.646623-6-christian.brauner@ubuntu.com
    Cc: Christoph Hellwig
    Cc: David Howells
    Cc: Al Viro
    Cc: linux-fsdevel@vger.kernel.org
    Reviewed-by: Christoph Hellwig
    Reviewed-by: James Morris
    Acked-by: Serge Hallyn
    Signed-off-by: Christian Brauner

    Christian Brauner
     

16 Dec, 2020

2 commits

  • Merge misc updates from Andrew Morton:

    - a few random little subsystems

    - almost all of the MM patches which are staged ahead of linux-next
    material. I'll trickle to post-linux-next work in as the dependents
    get merged up.

    Subsystems affected by this patch series: kthread, kbuild, ide, ntfs,
    ocfs2, arch, and mm (slab-generic, slab, slub, dax, debug, pagecache,
    gup, swap, shmem, memcg, pagemap, mremap, hmm, vmalloc, documentation,
    kasan, pagealloc, memory-failure, hugetlb, vmscan, z3fold, compaction,
    oom-kill, migration, cma, page-poison, userfaultfd, zswap, zsmalloc,
    uaccess, zram, and cleanups).

    * emailed patches from Andrew Morton : (200 commits)
    mm: cleanup kstrto*() usage
    mm: fix fall-through warnings for Clang
    mm: slub: convert sysfs sprintf family to sysfs_emit/sysfs_emit_at
    mm: shmem: convert shmem_enabled_show to use sysfs_emit_at
    mm:backing-dev: use sysfs_emit in macro defining functions
    mm: huge_memory: convert remaining use of sprintf to sysfs_emit and neatening
    mm: use sysfs_emit for struct kobject * uses
    mm: fix kernel-doc markups
    zram: break the strict dependency from lzo
    zram: add stat to gather incompressible pages since zram set up
    zram: support page writeback
    mm/process_vm_access: remove redundant initialization of iov_r
    mm/zsmalloc.c: rework the list_add code in insert_zspage()
    mm/zswap: move to use crypto_acomp API for hardware acceleration
    mm/zswap: fix passing zero to 'PTR_ERR' warning
    mm/zswap: make struct kernel_param_ops definitions const
    userfaultfd/selftests: hint the test runner on required privilege
    userfaultfd/selftests: fix retval check for userfaultfd_open()
    userfaultfd/selftests: always dump something in modes
    userfaultfd: selftests: make __{s,u}64 format specifiers portable
    ...

    Linus Torvalds
     
  • Rename the callback to reflect that it's not called *on* or *after* split,
    but rather some time before the splitting to check if it's possible.

    Link: https://lkml.kernel.org/r/20201013013416.390574-5-dima@arista.com
    Signed-off-by: Dmitry Safonov
    Cc: Alexander Viro
    Cc: Andy Lutomirski
    Cc: Brian Geffon
    Cc: Catalin Marinas
    Cc: Dan Carpenter
    Cc: Dan Williams
    Cc: Dave Jiang
    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Cc: Jason Gunthorpe
    Cc: John Hubbard
    Cc: "Kirill A. Shutemov"
    Cc: Mike Kravetz
    Cc: Minchan Kim
    Cc: Ralph Campbell
    Cc: Russell King
    Cc: Thomas Bogendoerfer
    Cc: Thomas Gleixner
    Cc: Vishal Verma
    Cc: Vlastimil Babka
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dmitry Safonov
     

15 Dec, 2020

1 commit

  • Pull misc fixes from Christian Brauner:
    "This contains several fixes which felt worth being combined into a
    single branch:

    - Use put_nsproxy() instead of open-coding it switch_task_namespaces()

    - Kirill's work to unify lifecycle management for all namespaces. The
    lifetime counters are used identically for all namespaces types.
    Namespaces may of course have additional unrelated counters and
    these are not altered. This work allows us to unify the type of the
    counters and reduces maintenance cost by moving the counter in one
    place and indicating that basic lifetime management is identical
    for all namespaces.

    - Peilin's fix adding three byte padding to Dmitry's
    PTRACE_GET_SYSCALL_INFO uapi struct to prevent an info leak.

    - Two smal patches to convert from the /* fall through */ comment
    annotation to the fallthrough keyword annotation which I had taken
    into my branch and into -next before df561f6688fe ("treewide: Use
    fallthrough pseudo-keyword") made it upstream which fixed this
    tree-wide.

    Since I didn't want to invalidate all testing for other commits I
    didn't rebase and kept them"

    * tag 'fixes-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
    nsproxy: use put_nsproxy() in switch_task_namespaces()
    sys: Convert to the new fallthrough notation
    signal: Convert to the new fallthrough notation
    time: Use generic ns_common::count
    cgroup: Use generic ns_common::count
    mnt: Use generic ns_common::count
    user: Use generic ns_common::count
    pid: Use generic ns_common::count
    ipc: Use generic ns_common::count
    uts: Use generic ns_common::count
    net: Use generic ns_common::count
    ns: Add a common refcount into ns_common
    ptrace: Prevent kernel-infoleak in ptrace_get_syscall_info()

    Linus Torvalds
     

06 Sep, 2020

1 commit

  • Commit 32927393dc1c ("sysctl: pass kernel pointers to ->proc_handler")
    changed ctl_table.proc_handler to take a kernel pointer. Adjust the
    signature of proc_ipc_sem_dointvec to match ctl_table.proc_handler which
    fixes the following sparse error/warning:

    ipc/ipc_sysctl.c:94:47: warning: incorrect type in argument 3 (different address spaces)
    ipc/ipc_sysctl.c:94:47: expected void *buffer
    ipc/ipc_sysctl.c:94:47: got void [noderef] __user *buffer
    ipc/ipc_sysctl.c:194:35: warning: incorrect type in initializer (incompatible argument 3 (different address spaces))
    ipc/ipc_sysctl.c:194:35: expected int ( [usertype] *proc_handler )( ... )
    ipc/ipc_sysctl.c:194:35: got int ( * )( ... )

    Fixes: 32927393dc1c ("sysctl: pass kernel pointers to ->proc_handler")
    Signed-off-by: Tobias Klauser
    Signed-off-by: Andrew Morton
    Cc: Christoph Hellwig
    Cc: Alexander Viro
    Link: https://lkml.kernel.org/r/20200825105846.5193-1-tklauser@distanz.ch
    Signed-off-by: Linus Torvalds

    Tobias Klauser
     

24 Aug, 2020

1 commit

  • Replace the existing /* fall through */ comments and its variants with
    the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
    fall-through markings when it is the case.

    [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through

    Signed-off-by: Gustavo A. R. Silva

    Gustavo A. R. Silva
     

19 Aug, 2020

1 commit

  • Switch over ipc namespaces to use the newly introduced common lifetime
    counter.

    Currently every namespace type has its own lifetime counter which is stored
    in the specific namespace struct. The lifetime counters are used
    identically for all namespaces types. Namespaces may of course have
    additional unrelated counters and these are not altered.

    This introduces a common lifetime counter into struct ns_common. The
    ns_common struct encompasses information that all namespaces share. That
    should include the lifetime counter since its common for all of them.

    It also allows us to unify the type of the counters across all namespaces.
    Most of them use refcount_t but one uses atomic_t and at least one uses
    kref. Especially the last one doesn't make much sense since it's just a
    wrapper around refcount_t since 2016 and actually complicates cleanup
    operations by having to use container_of() to cast the correct namespace
    struct out of struct ns_common.

    Having the lifetime counter for the namespaces in one place reduces
    maintenance cost. Not just because after switching all namespaces over we
    will have removed more code than we added but also because the logic is
    more easily understandable and we indicate to the user that the basic
    lifetime requirements for all namespaces are currently identical.

    Signed-off-by: Kirill Tkhai
    Reviewed-by: Kees Cook
    Acked-by: Christian Brauner
    Link: https://lore.kernel.org/r/159644978697.604812.16592754423881032385.stgit@localhost.localdomain
    Signed-off-by: Christian Brauner

    Kirill Tkhai
     

13 Aug, 2020

2 commits

  • Remove the superfuous break, as there is a 'return' before it.

    Signed-off-by: Liao Pingfang
    Signed-off-by: Yi Wang
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Link: http://lkml.kernel.org/r/1594724361-11525-1-git-send-email-wang.yi59@zte.com.cn
    Signed-off-by: Linus Torvalds

    Liao Pingfang
     
  • Two functions are only called via function pointers, don't bother
    inlining them.

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Cc: Manfred Spraul
    Cc: Davidlohr Bueso
    Link: http://lkml.kernel.org/r/20200710200312.GA960353@localhost.localdomain
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

08 Aug, 2020

1 commit

  • The current split between do_mmap() and do_mmap_pgoff() was introduced in
    commit 1fcfd8db7f82 ("mm, mpx: add "vm_flags_t vm_flags" arg to
    do_mmap_pgoff()") to support MPX.

    The wrapper function do_mmap_pgoff() always passed 0 as the value of the
    vm_flags argument to do_mmap(). However, MPX support has subsequently
    been removed from the kernel and there were no more direct callers of
    do_mmap(); all calls were going via do_mmap_pgoff().

    Simplify the code by removing do_mmap_pgoff() and changing all callers to
    directly call do_mmap(), which now no longer takes a vm_flags argument.

    Signed-off-by: Peter Collingbourne
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Reviewed-by: David Hildenbrand
    Link: http://lkml.kernel.org/r/20200727194109.1371462-1-pcc@google.com
    Signed-off-by: Linus Torvalds

    Peter Collingbourne
     

10 Jun, 2020

1 commit

  • This change converts the existing mmap_sem rwsem calls to use the new mmap
    locking API instead.

    The change is generated using coccinelle with the following rule:

    // spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir .

    @@
    expression mm;
    @@
    (
    -init_rwsem
    +mmap_init_lock
    |
    -down_write
    +mmap_write_lock
    |
    -down_write_killable
    +mmap_write_lock_killable
    |
    -down_write_trylock
    +mmap_write_trylock
    |
    -up_write
    +mmap_write_unlock
    |
    -downgrade_write
    +mmap_write_downgrade
    |
    -down_read
    +mmap_read_lock
    |
    -down_read_killable
    +mmap_read_lock_killable
    |
    -down_read_trylock
    +mmap_read_trylock
    |
    -up_read
    +mmap_read_unlock
    )
    -(&mm->mmap_sem)
    +(mm)

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Daniel Jordan
    Reviewed-by: Laurent Dufour
    Reviewed-by: Vlastimil Babka
    Cc: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     

09 Jun, 2020

2 commits

  • the reason is to avoid a delay caused by the synchronize_rcu() call in
    kern_umount() when the mqueue mount is freed.

    the code:

    #define _GNU_SOURCE
    #include
    #include
    #include
    #include

    int main()
    {
    int i;

    for (i = 0; i < 1000; i++)
    if (unshare(CLONE_NEWIPC) < 0)
    error(EXIT_FAILURE, errno, "unshare");
    }

    goes from

    Command being timed: "./ipc-namespace"
    User time (seconds): 0.00
    System time (seconds): 0.06
    Percent of CPU this job got: 0%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:08.05

    to

    Command being timed: "./ipc-namespace"
    User time (seconds): 0.00
    System time (seconds): 0.02
    Percent of CPU this job got: 96%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.03

    Signed-off-by: Giuseppe Scrivano
    Signed-off-by: Andrew Morton
    Reviewed-by: Paul E. McKenney
    Reviewed-by: Waiman Long
    Cc: Davidlohr Bueso
    Cc: Manfred Spraul
    Link: http://lkml.kernel.org/r/20200225145419.527994-1-gscrivan@redhat.com
    Signed-off-by: Linus Torvalds

    Giuseppe Scrivano
     
  • Sparse reports a warning at freeque()

    warning: context imbalance in freeque() - unexpected unlock

    The root cause is the missing annotation at freeque()

    Add the missing __releases(RCU) annotation
    Add the missing __releases(&msq->q_perm) annotation

    Signed-off-by: Jules Irenge
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Cc: Boqun Feng
    Cc: Lu Shuaibing
    Cc: Nathan Chancellor
    Cc: Manfred Spraul
    Cc: Davidlohr Bueso
    Link: http://lkml.kernel.org/r/20200403160505.2832-2-jbi.octave@gmail.com
    Signed-off-by: Linus Torvalds

    Jules Irenge
     

04 Jun, 2020

2 commits

  • Pull networking updates from David Miller:

    1) Allow setting bluetooth L2CAP modes via socket option, from Luiz
    Augusto von Dentz.

    2) Add GSO partial support to igc, from Sasha Neftin.

    3) Several cleanups and improvements to r8169 from Heiner Kallweit.

    4) Add IF_OPER_TESTING link state and use it when ethtool triggers a
    device self-test. From Andrew Lunn.

    5) Start moving away from custom driver versions, use the globally
    defined kernel version instead, from Leon Romanovsky.

    6) Support GRO vis gro_cells in DSA layer, from Alexander Lobakin.

    7) Allow hard IRQ deferral during NAPI, from Eric Dumazet.

    8) Add sriov and vf support to hinic, from Luo bin.

    9) Support Media Redundancy Protocol (MRP) in the bridging code, from
    Horatiu Vultur.

    10) Support netmap in the nft_nat code, from Pablo Neira Ayuso.

    11) Allow UDPv6 encapsulation of ESP in the ipsec code, from Sabrina
    Dubroca. Also add ipv6 support for espintcp.

    12) Lots of ReST conversions of the networking documentation, from Mauro
    Carvalho Chehab.

    13) Support configuration of ethtool rxnfc flows in bcmgenet driver,
    from Doug Berger.

    14) Allow to dump cgroup id and filter by it in inet_diag code, from
    Dmitry Yakunin.

    15) Add infrastructure to export netlink attribute policies to
    userspace, from Johannes Berg.

    16) Several optimizations to sch_fq scheduler, from Eric Dumazet.

    17) Fallback to the default qdisc if qdisc init fails because otherwise
    a packet scheduler init failure will make a device inoperative. From
    Jesper Dangaard Brouer.

    18) Several RISCV bpf jit optimizations, from Luke Nelson.

    19) Correct the return type of the ->ndo_start_xmit() method in several
    drivers, it's netdev_tx_t but many drivers were using
    'int'. From Yunjian Wang.

    20) Add an ethtool interface for PHY master/slave config, from Oleksij
    Rempel.

    21) Add BPF iterators, from Yonghang Song.

    22) Add cable test infrastructure, including ethool interfaces, from
    Andrew Lunn. Marvell PHY driver is the first to support this
    facility.

    23) Remove zero-length arrays all over, from Gustavo A. R. Silva.

    24) Calculate and maintain an explicit frame size in XDP, from Jesper
    Dangaard Brouer.

    25) Add CAP_BPF, from Alexei Starovoitov.

    26) Support terse dumps in the packet scheduler, from Vlad Buslov.

    27) Support XDP_TX bulking in dpaa2 driver, from Ioana Ciornei.

    28) Add devm_register_netdev(), from Bartosz Golaszewski.

    29) Minimize qdisc resets, from Cong Wang.

    30) Get rid of kernel_getsockopt and kernel_setsockopt in order to
    eliminate set_fs/get_fs calls. From Christoph Hellwig.

    * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2517 commits)
    selftests: net: ip_defrag: ignore EPERM
    net_failover: fixed rollback in net_failover_open()
    Revert "tipc: Fix potential tipc_aead refcnt leak in tipc_crypto_rcv"
    Revert "tipc: Fix potential tipc_node refcnt leak in tipc_rcv"
    vmxnet3: allow rx flow hash ops only when rss is enabled
    hinic: add set_channels ethtool_ops support
    selftests/bpf: Add a default $(CXX) value
    tools/bpf: Don't use $(COMPILE.c)
    bpf, selftests: Use bpf_probe_read_kernel
    s390/bpf: Use bcr 0,%0 as tail call nop filler
    s390/bpf: Maintain 8-byte stack alignment
    selftests/bpf: Fix verifier test
    selftests/bpf: Fix sample_cnt shared between two threads
    bpf, selftests: Adapt cls_redirect to call csum_level helper
    bpf: Add csum_level helper for fixing up csum levels
    bpf: Fix up bpf_skb_adjust_room helper's skb csum setting
    sfc: add missing annotation for efx_ef10_try_update_nic_stats_vf()
    crypto/chtls: IPv6 support for inline TLS
    Crypto/chcr: Fixes a coccinile check error
    Crypto/chcr: Fixes compilations warnings
    ...

    Linus Torvalds
     
  • Pull thread updates from Christian Brauner:
    "We have been discussing using pidfds to attach to namespaces for quite
    a while and the patches have in one form or another already existed
    for about a year. But I wanted to wait to see how the general api
    would be received and adopted.

    This contains the changes to make it possible to use pidfds to attach
    to the namespaces of a process, i.e. they can be passed as the first
    argument to the setns() syscall.

    When only a single namespace type is specified the semantics are
    equivalent to passing an nsfd. That means setns(nsfd, CLONE_NEWNET)
    equals setns(pidfd, CLONE_NEWNET).

    However, when a pidfd is passed, multiple namespace flags can be
    specified in the second setns() argument and setns() will attach the
    caller to all the specified namespaces all at once or to none of them.

    Specifying 0 is not valid together with a pidfd. Here are just two
    obvious examples:

    setns(pidfd, CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET);
    setns(pidfd, CLONE_NEWUSER);

    Allowing to also attach subsets of namespaces supports various
    use-cases where callers setns to a subset of namespaces to retain
    privilege, perform an action and then re-attach another subset of
    namespaces.

    Apart from significantly reducing the number of syscalls needed to
    attach to all currently supported namespaces (eight "open+setns"
    sequences vs just a single "setns()"), this also allows atomic setns
    to a set of namespaces, i.e. either attaching to all namespaces
    succeeds or we fail without having changed anything.

    This is centered around a new internal struct nsset which holds all
    information necessary for a task to switch to a new set of namespaces
    atomically. Fwiw, with this change a pidfd becomes the only token
    needed to interact with a container. I'm expecting this to be
    picked-up by util-linux for nsenter rather soon.

    Associated with this change is a shiny new test-suite dedicated to
    setns() (for pidfds and nsfds alike)"

    * tag 'threads-v5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
    selftests/pidfd: add pidfd setns tests
    nsproxy: attach to namespaces via pidfds
    nsproxy: add struct nsset

    Linus Torvalds
     

16 May, 2020

1 commit