08 Apr, 2020

1 commit

  • Commit 769071ac9f20 "ns: Introduce Time Namespace" broke reporting of
    inotify ucounts (max_inotify_instances, max_inotify_watches) in
    /proc/sys/user because it has added UCOUNT_TIME_NAMESPACES into enum
    ucount_type but didn't properly update reporting in
    kernel/ucount.c:setup_userns_sysctls(). This problem got fixed in commit
    eeec26d5da82 "time/namespace: Add max_time_namespaces ucount".

    Add BUILD_BUG_ON to catch a similar problem in the future.

    Signed-off-by: Jan Kara
    Signed-off-by: Thomas Gleixner
    Acked-by: Andrei Vagin
    Link: https://lkml.kernel.org/r/20200407154643.10102-1-jack@suse.cz

    Jan Kara
     

07 Apr, 2020

1 commit

  • Michael noticed that userns limit for number of time namespaces is missing.

    Furthermore, time namespace introduced UCOUNT_TIME_NAMESPACES, but didn't
    introduce an array member in user_table[]. It would make array's
    initialisation OOB write, but by luck the user_table array has an excessive
    empty member (all accesses to the array are limited with UCOUNT_COUNTS - so
    it silently reuses the last free member.

    Fixes user-visible regression: max_inotify_instances by reason of the
    missing UCOUNT_ENTRY() has limited max number of namespaces instead of the
    number of inotify instances.

    Fixes: 769071ac9f20 ("ns: Introduce Time Namespace")
    Reported-by: Michael Kerrisk (man-pages)
    Signed-off-by: Dmitry Safonov
    Signed-off-by: Thomas Gleixner
    Acked-by: Andrei Vagin
    Acked-by: Vincenzo Frascino
    Cc: stable@kernel.org
    Link: https://lkml.kernel.org/r/20200406171342.128733-1-dima@arista.com

    Dmitry Safonov
     

19 Jul, 2019

1 commit

  • In the sysctl code the proc_dointvec_minmax() function is often used to
    validate the user supplied value between an allowed range. This
    function uses the extra1 and extra2 members from struct ctl_table as
    minimum and maximum allowed value.

    On sysctl handler declaration, in every source file there are some
    readonly variables containing just an integer which address is assigned
    to the extra1 and extra2 members, so the sysctl range is enforced.

    The special values 0, 1 and INT_MAX are very often used as range
    boundary, leading duplication of variables like zero=0, one=1,
    int_max=INT_MAX in different source files:

    $ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
    248

    Add a const int array containing the most commonly used values, some
    macros to refer more easily to the correct array member, and use them
    instead of creating a local one for every object file.

    This is the bloat-o-meter output comparing the old and new binary
    compiled with the default Fedora config:

    # scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
    add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
    Data old new delta
    sysctl_vals - 12 +12
    __kstrtab_sysctl_vals - 12 +12
    max 14 10 -4
    int_max 16 - -16
    one 68 - -68
    zero 128 28 -100
    Total: Before=20583249, After=20583085, chg -0.00%

    [mcroce@redhat.com: tipc: remove two unused variables]
    Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
    [akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
    [arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
    Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
    [akpm@linux-foundation.org: fix fs/eventpoll.c]
    Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
    Signed-off-by: Matteo Croce
    Signed-off-by: Arnd Bergmann
    Acked-by: Kees Cook
    Reviewed-by: Aaron Tomlin
    Cc: Matthew Wilcox
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matteo Croce
     

05 Jun, 2019

1 commit

  • Based on 1 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license as published by
    the free software foundation version 2 of the license

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-only

    has been chosen to replace the boilerplate/reference in 315 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Allison Randal
    Reviewed-by: Armijn Hemel
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190531190115.503150771@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

06 Apr, 2018

1 commit

  • Currently #includes for no obvious
    reason. It looks like it's only a convenience, so remove kmemleak.h
    from slab.h and add to any users of kmemleak_* that
    don't already #include it. Also remove from source
    files that do not use it.

    This is tested on i386 allmodconfig and x86_64 allmodconfig. It would
    be good to run it through the 0day bot for other $ARCHes. I have
    neither the horsepower nor the storage space for the other $ARCHes.

    Update: This patch has been extensively build-tested by both the 0day
    bot & kisskb/ozlabs build farms. Both of them reported 2 build failures
    for which patches are included here (in v2).

    [ slab.h is the second most used header file after module.h; kernel.h is
    right there with slab.h. There could be some minor error in the
    counting due to some #includes having comments after them and I didn't
    combine all of those. ]

    [akpm@linux-foundation.org: security/keys/big_key.c needs vmalloc.h, per sfr]
    Link: http://lkml.kernel.org/r/e4309f98-3749-93e1-4bb7-d9501a39d015@infradead.org
    Link: http://kisskb.ellerman.id.au/kisskb/head/13396/
    Signed-off-by: Randy Dunlap
    Reviewed-by: Ingo Molnar
    Reported-by: Michael Ellerman [2 build failures]
    Reported-by: Fengguang Wu [2 build failures]
    Reviewed-by: Andrew Morton
    Cc: Wei Yongjun
    Cc: Luis R. Rodriguez
    Cc: Greg Kroah-Hartman
    Cc: Mimi Zohar
    Cc: John Johansen
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     

07 Mar, 2017

1 commit

  • Always increment/decrement ucount->count under the ucounts_lock. The
    increments are there already and moving the decrements there means the
    locking logic of the code is simpler. This simplification in the
    locking logic fixes a race between put_ucounts and get_ucounts that
    could result in a use-after-free because the count could go zero then
    be found by get_ucounts and then be freed by put_ucounts.

    A bug presumably this one was found by a combination of syzkaller and
    KASAN. JongWhan Kim reported the syzkaller failure and Dmitry Vyukov
    spotted the race in the code.

    Cc: stable@vger.kernel.org
    Fixes: f6b2db1a3e8d ("userns: Make the count of user namespaces per user")
    Reported-by: JongHwan Kim
    Reported-by: Dmitry Vyukov
    Reviewed-by: Andrei Vagin
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

02 Mar, 2017

1 commit


24 Feb, 2017

1 commit

  • Pull namespace updates from Eric Biederman:
    "There is a lot here. A lot of these changes result in subtle user
    visible differences in kernel behavior. I don't expect anything will
    care but I will revert/fix things immediately if any regressions show
    up.

    From Seth Forshee there is a continuation of the work to make the vfs
    ready for unpriviled mounts. We had thought the previous changes
    prevented the creation of files outside of s_user_ns of a filesystem,
    but it turns we missed the O_CREAT path. Ooops.

    Pavel Tikhomirov and Oleg Nesterov worked together to fix a long
    standing bug in the implemenation of PR_SET_CHILD_SUBREAPER where only
    children that are forked after the prctl are considered and not
    children forked before the prctl. The only known user of this prctl
    systemd forks all children after the prctl. So no userspace
    regressions will occur. Holding earlier forked children to the same
    rules as later forked children creates a semantic that is sane enough
    to allow checkpoing of processes that use this feature.

    There is a long delayed change by Nikolay Borisov to limit inotify
    instances inside a user namespace.

    Michael Kerrisk extends the API for files used to maniuplate
    namespaces with two new trivial ioctls to allow discovery of the
    hierachy and properties of namespaces.

    Konstantin Khlebnikov with the help of Al Viro adds code that when a
    network namespace exits purges it's sysctl entries from the dcache. As
    in some circumstances this could use a lot of memory.

    Vivek Goyal fixed a bug with stacked filesystems where the permissions
    on the wrong inode were being checked.

    I continue previous work on ptracing across exec. Allowing a file to
    be setuid across exec while being ptraced if the tracer has enough
    credentials in the user namespace, and if the process has CAP_SETUID
    in it's own namespace. Proc files for setuid or otherwise undumpable
    executables are now owned by the root in the user namespace of their
    mm. Allowing debugging of setuid applications in containers to work
    better.

    A bug I introduced with permission checking and automount is now
    fixed. The big change is to mark the mounts that the kernel initiates
    as a result of an automount. This allows the permission checks in sget
    to be safely suppressed for this kind of mount. As the permission
    check happened when the original filesystem was mounted.

    Finally a special case in the mount namespace is removed preventing
    unbounded chains in the mount hash table, and making the semantics
    simpler which benefits CRIU.

    The vfs fix along with related work in ima and evm I believe makes us
    ready to finish developing and merge fully unprivileged mounts of the
    fuse filesystem. The cleanups of the mount namespace makes discussing
    how to fix the worst case complexity of umount. The stacked filesystem
    fixes pave the way for adding multiple mappings for the filesystem
    uids so that efficient and safer containers can be implemented"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    proc/sysctl: Don't grab i_lock under sysctl_lock.
    vfs: Use upper filesystem inode in bprm_fill_uid()
    proc/sysctl: prune stale dentries during unregistering
    mnt: Tuck mounts under others instead of creating shadow/side mounts.
    prctl: propagate has_child_subreaper flag to every descendant
    introduce the walk_process_tree() helper
    nsfs: Add an ioctl() to return owner UID of a userns
    fs: Better permission checking for submounts
    exit: fix the setns() && PR_SET_CHILD_SUBREAPER interaction
    vfs: open() with O_CREAT should not create inodes with unknown ids
    nsfs: Add an ioctl() to return the namespace type
    proc: Better ownership of files for non-dumpable tasks in user namespaces
    exec: Remove LSM_UNSAFE_PTRACE_CAP
    exec: Test the ptracer's saved cred to see if the tracee can gain caps
    exec: Don't reset euid and egid when the tracee has CAP_SETUID
    inotify: Convert to using per-namespace limits

    Linus Torvalds
     

09 Feb, 2017

1 commit

  • The user_header gets caught by kmemleak with the following splat as
    missing a free:

    unreferenced object 0xffff99667a733d80 (size 96):
    comm "swapper/0", pid 1, jiffies 4294892317 (age 62191.468s)
    hex dump (first 32 bytes):
    a0 b6 92 b4 ff ff ff ff 00 00 00 00 01 00 00 00 ................
    01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    backtrace:
    kmemleak_alloc+0x4a/0xa0
    __kmalloc+0x144/0x260
    __register_sysctl_table+0x54/0x5e0
    register_sysctl+0x1b/0x20
    user_namespace_sysctl_init+0x17/0x34
    do_one_initcall+0x52/0x1a0
    kernel_init_freeable+0x173/0x200
    kernel_init+0xe/0x100
    ret_from_fork+0x2c/0x40

    The BUG_ON()s are intended to crash so no need to clean up after
    ourselves on error there. This is also a kernel/ subsys_init() we don't
    need a respective exit call here as this is never modular, so just white
    list it.

    Link: http://lkml.kernel.org/r/20170203211404.31458-1-mcgrof@kernel.org
    Signed-off-by: Luis R. Rodriguez
    Cc: Eric W. Biederman
    Cc: Kees Cook
    Cc: Nikolay Borisov
    Cc: Serge Hallyn
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Luis R. Rodriguez
     

24 Jan, 2017

2 commits

  • This patchset converts inotify to using the newly introduced
    per-userns sysctl infrastructure.

    Currently the inotify instances/watches are being accounted in the
    user_struct structure. This means that in setups where multiple
    users in unprivileged containers map to the same underlying
    real user (i.e. pointing to the same user_struct) the inotify limits
    are going to be shared as well, allowing one user(or application) to exhaust
    all others limits.

    Fix this by switching the inotify sysctls to using the
    per-namespace/per-user limits. This will allow the server admin to
    set sensible global limits, which can further be tuned inside every
    individual user namespace. Additionally, in order to preserve the
    sysctl ABI make the existing inotify instances/watches sysctls
    modify the values of the initial user namespace.

    Signed-off-by: Nikolay Borisov
    Acked-by: Jan Kara
    Acked-by: Serge Hallyn
    Signed-off-by: Eric W. Biederman

    Nikolay Borisov
     
  • The ucounts_lock is being used to protect various ucounts lifecycle
    management functionalities. However, those services can also be invoked
    when a pidns is being freed in an RCU callback (e.g. softirq context).
    This can lead to deadlocks. There were already efforts trying to
    prevent similar deadlocks in add7c65ca426 ("pid: fix lockdep deadlock
    warning due to ucount_lock"), however they just moved the context
    from hardirq to softrq. Fix this issue once and for all by explictly
    making the lock disable irqs altogether.

    Dmitry Vyukov reported:

    > I've got the following deadlock report while running syzkaller fuzzer
    > on eec0d3d065bfcdf9cd5f56dd2a36b94d12d32297 of linux-next (on odroid
    > device if it matters):
    >
    > =================================
    > [ INFO: inconsistent lock state ]
    > 4.10.0-rc3-next-20170112-xc2-dirty #6 Not tainted
    > ---------------------------------
    > inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
    > swapper/2/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
    > (ucounts_lock){+.?...}, at: [< inline >] spin_lock
    > ./include/linux/spinlock.h:302
    > (ucounts_lock){+.?...}, at: []
    > put_ucounts+0x60/0x138 kernel/ucount.c:162
    > {SOFTIRQ-ON-W} state was registered at:
    > [] mark_lock+0x220/0xb60 kernel/locking/lockdep.c:3054
    > [< inline >] mark_irqflags kernel/locking/lockdep.c:2941
    > [] __lock_acquire+0x388/0x3260 kernel/locking/lockdep.c:3295
    > [] lock_acquire+0xa4/0x138 kernel/locking/lockdep.c:3753
    > [< inline >] __raw_spin_lock ./include/linux/spinlock_api_smp.h:144
    > [] _raw_spin_lock+0x90/0xd0 kernel/locking/spinlock.c:151
    > [< inline >] spin_lock ./include/linux/spinlock.h:302
    > [< inline >] get_ucounts kernel/ucount.c:131
    > [] inc_ucount+0x80/0x6c8 kernel/ucount.c:189
    > [< inline >] inc_mnt_namespaces fs/namespace.c:2818
    > [] alloc_mnt_ns+0x78/0x3a8 fs/namespace.c:2849
    > [] create_mnt_ns+0x28/0x200 fs/namespace.c:2959
    > [< inline >] init_mount_tree fs/namespace.c:3199
    > [] mnt_init+0x258/0x384 fs/namespace.c:3251
    > [] vfs_caches_init+0x6c/0x80 fs/dcache.c:3626
    > [] start_kernel+0x414/0x460 init/main.c:648
    > [] __primary_switched+0x6c/0x70 arch/arm64/kernel/head.S:456
    > irq event stamp: 2316924
    > hardirqs last enabled at (2316924): [< inline >] rcu_do_batch
    > kernel/rcu/tree.c:2911
    > hardirqs last enabled at (2316924): [< inline >]
    > invoke_rcu_callbacks kernel/rcu/tree.c:3182
    > hardirqs last enabled at (2316924): [< inline >]
    > __rcu_process_callbacks kernel/rcu/tree.c:3149
    > hardirqs last enabled at (2316924): []
    > rcu_process_callbacks+0x7a4/0xc28 kernel/rcu/tree.c:3166
    > hardirqs last disabled at (2316923): [< inline >] rcu_do_batch
    > kernel/rcu/tree.c:2900
    > hardirqs last disabled at (2316923): [< inline >]
    > invoke_rcu_callbacks kernel/rcu/tree.c:3182
    > hardirqs last disabled at (2316923): [< inline >]
    > __rcu_process_callbacks kernel/rcu/tree.c:3149
    > hardirqs last disabled at (2316923): []
    > rcu_process_callbacks+0x210/0xc28 kernel/rcu/tree.c:3166
    > softirqs last enabled at (2316912): []
    > _local_bh_enable+0x4c/0x80 kernel/softirq.c:155
    > softirqs last disabled at (2316913): [< inline >]
    > do_softirq_own_stack ./include/linux/interrupt.h:488
    > softirqs last disabled at (2316913): [< inline >]
    > invoke_softirq kernel/softirq.c:371
    > softirqs last disabled at (2316913): []
    > irq_exit+0x264/0x308 kernel/softirq.c:405
    >
    > other info that might help us debug this:
    > Possible unsafe locking scenario:
    >
    > CPU0
    > ----
    > lock(ucounts_lock);
    >
    > lock(ucounts_lock);
    >
    > *** DEADLOCK ***
    >
    > 1 lock held by swapper/2/0:
    > #0: (rcu_callback){......}, at: [< inline >] __rcu_reclaim
    > kernel/rcu/rcu.h:108
    > #0: (rcu_callback){......}, at: [< inline >] rcu_do_batch
    > kernel/rcu/tree.c:2919
    > #0: (rcu_callback){......}, at: [< inline >]
    > invoke_rcu_callbacks kernel/rcu/tree.c:3182
    > #0: (rcu_callback){......}, at: [< inline >]
    > __rcu_process_callbacks kernel/rcu/tree.c:3149
    > #0: (rcu_callback){......}, at: []
    > rcu_process_callbacks+0x720/0xc28 kernel/rcu/tree.c:3166
    >
    > stack backtrace:
    > CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.10.0-rc3-next-20170112-xc2-dirty #6
    > Hardware name: Hardkernel ODROID-C2 (DT)
    > Call trace:
    > [] dump_backtrace+0x0/0x440 arch/arm64/kernel/traps.c:500
    > [] show_stack+0x20/0x30 arch/arm64/kernel/traps.c:225
    > [] dump_stack+0x110/0x168
    > [] print_usage_bug.part.27+0x49c/0x4bc
    > kernel/locking/lockdep.c:2387
    > [< inline >] print_usage_bug kernel/locking/lockdep.c:2357
    > [< inline >] valid_state kernel/locking/lockdep.c:2400
    > [< inline >] mark_lock_irq kernel/locking/lockdep.c:2617
    > [] mark_lock+0x934/0xb60 kernel/locking/lockdep.c:3065
    > [< inline >] mark_irqflags kernel/locking/lockdep.c:2923
    > [] __lock_acquire+0x640/0x3260 kernel/locking/lockdep.c:3295
    > [] lock_acquire+0xa4/0x138 kernel/locking/lockdep.c:3753
    > [< inline >] __raw_spin_lock ./include/linux/spinlock_api_smp.h:144
    > [] _raw_spin_lock+0x90/0xd0 kernel/locking/spinlock.c:151
    > [< inline >] spin_lock ./include/linux/spinlock.h:302
    > [] put_ucounts+0x60/0x138 kernel/ucount.c:162
    > [] dec_ucount+0xf4/0x158 kernel/ucount.c:214
    > [< inline >] dec_pid_namespaces kernel/pid_namespace.c:89
    > [] delayed_free_pidns+0x40/0xe0 kernel/pid_namespace.c:156
    > [< inline >] __rcu_reclaim kernel/rcu/rcu.h:118
    > [< inline >] rcu_do_batch kernel/rcu/tree.c:2919
    > [< inline >] invoke_rcu_callbacks kernel/rcu/tree.c:3182
    > [< inline >] __rcu_process_callbacks kernel/rcu/tree.c:3149
    > [] rcu_process_callbacks+0x768/0xc28 kernel/rcu/tree.c:3166
    > [] __do_softirq+0x324/0x6e0 kernel/softirq.c:284
    > [< inline >] do_softirq_own_stack ./include/linux/interrupt.h:488
    > [< inline >] invoke_softirq kernel/softirq.c:371
    > [] irq_exit+0x264/0x308 kernel/softirq.c:405
    > [] __handle_domain_irq+0xc0/0x150 kernel/irq/irqdesc.c:636
    > [] gic_handle_irq+0x68/0xd8
    > Exception stack(0xffff8000648e7dd0 to 0xffff8000648e7f00)
    > 7dc0: ffff8000648d4b3c 0000000000000007
    > 7de0: 0000000000000000 1ffff0000c91a967 1ffff0000c91a967 1ffff0000c91a967
    > 7e00: ffff20000a4b6b68 0000000000000001 0000000000000007 0000000000000001
    > 7e20: 1fffe4000149ae90 ffff200009d35000 0000000000000000 0000000000000002
    > 7e40: 0000000000000000 0000000000000000 0000000002624a1a 0000000000000000
    > 7e60: 0000000000000000 ffff200009cbcd88 000060006d2ed000 0000000000000140
    > 7e80: ffff200009cff000 ffff200009cb6000 ffff200009cc2020 ffff200009d2159d
    > 7ea0: 0000000000000000 ffff8000648d4380 0000000000000000 ffff8000648e7f00
    > 7ec0: ffff20000820a478 ffff8000648e7f00 ffff20000820a47c 0000000010000145
    > 7ee0: 0000000000000140 dfff200000000000 ffffffffffffffff ffff20000820a478
    > [] el1_irq+0xb8/0x130 arch/arm64/kernel/entry.S:486
    > [< inline >] arch_local_irq_restore
    > ./arch/arm64/include/asm/irqflags.h:81
    > [] rcu_idle_exit+0x64/0xa8 kernel/rcu/tree.c:1030
    > [< inline >] cpuidle_idle_call kernel/sched/idle.c:200
    > [] do_idle+0x1dc/0x2d0 kernel/sched/idle.c:243
    > [] cpu_startup_entry+0x24/0x28 kernel/sched/idle.c:345
    > [] secondary_start_kernel+0x2cc/0x358
    > arch/arm64/kernel/smp.c:276
    > [] 0x279f1a4

    Reported-by: Dmitry Vyukov
    Tested-by: Dmitry Vyukov
    Fixes: add7c65ca426 ("pid: fix lockdep deadlock warning due to ucount_lock")
    Fixes: f333c700c610 ("pidns: Add a limit on the number of pid namespaces")
    Cc: stable@vger.kernel.org
    Link: https://www.spinics.net/lists/kernel/msg2426637.html
    Signed-off-by: Nikolay Borisov
    Signed-off-by: Eric W. Biederman

    Nikolay Borisov
     

31 Aug, 2016

1 commit


09 Aug, 2016

9 commits