23 Feb, 2022

1 commit

  • commit 0cbae9e24fa7d6c6e9f828562f084da82217a0c5 upstream.

    While examining is_ucounts_overlimit and reading the various messages
    I realized that is_ucounts_overlimit fails to deal with counts that
    may have wrapped.

    Being wrapped should be a transitory state for counts and they should
    never be wrapped for long, but it can happen so handle it.

    Cc: stable@vger.kernel.org
    Fixes: 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts")
    Link: https://lkml.kernel.org/r/20220216155832.680775-5-ebiederm@xmission.com
    Reviewed-by: Shuah Khan
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     

02 Feb, 2022

1 commit

  • commit f9d87929d451d3e649699d0f1d74f71f77ad38f5 upstream.

    When the ucount code was refactored to create get_ucount it was missed
    that some of the contexts in which a rlimit is kept elevated can be
    the only reference to the user/ucount in the system.

    Ordinary ucount references exist in places that also have a reference
    to the user namspace, but in POSIX message queues, the SysV shm code,
    and the SIGPENDING code there is no independent user namespace
    reference.

    Inspection of the the user_namespace show no instance of circular
    references between struct ucounts and the user_namespace. So
    hold a reference from struct ucount to i's user_namespace to
    resolve this problem.

    Link: https://lore.kernel.org/lkml/YZV7Z+yXbsx9p3JN@fixkernel.com/
    Reported-by: Qian Cai
    Reported-by: Mathias Krause
    Tested-by: Mathias Krause
    Reviewed-by: Mathias Krause
    Reviewed-by: Alexey Gladkov
    Fixes: d64696905554 ("Reimplement RLIMIT_SIGPENDING on top of ucounts")
    Fixes: 6e52a9f0532f ("Reimplement RLIMIT_MSGQUEUE on top of ucounts")
    Fixes: d7c9e99aee48 ("Reimplement RLIMIT_MEMLOCK on top of ucounts")
    Cc: stable@vger.kernel.org
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     

29 Dec, 2021

1 commit

  • [ Upstream commit 59ec71575ab440cd5ca0aa53b2a2985b3639fad4 ]

    The semantics of the rlimit max values differs from ucounts itself. When
    creating a new userns, we store the current rlimit of the process in
    ucount_max. Thus, the value of the limit in the parent userns is saved
    in the created one.

    The problem is that now we are taking the maximum value for counter from
    the same userns. So for init_user_ns it will always be RLIM_INFINITY.

    To fix the problem we need to check the counter value with the max value
    stored in userns.

    Reproducer:

    su - test -c "ulimit -u 3; sleep 5 & sleep 6 & unshare -U --map-root-user sh -c 'sleep 7 & sleep 8 & date; wait'"

    Before:

    [1] 175
    [2] 176
    Fri Nov 26 13:48:20 UTC 2021
    [1]- Done sleep 5
    [2]+ Done sleep 6

    After:

    [1] 167
    [2] 168
    sh: fork: retry: Resource temporarily unavailable
    sh: fork: retry: Resource temporarily unavailable
    sh: fork: retry: Resource temporarily unavailable
    sh: fork: retry: Resource temporarily unavailable
    sh: fork: retry: Resource temporarily unavailable
    sh: fork: retry: Resource temporarily unavailable
    sh: fork: retry: Resource temporarily unavailable
    sh: fork: Interrupted system call
    [1]- Done sleep 5
    [2]+ Done sleep 6

    Fixes: c54b245d0118 ("Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace")
    Reported-by: Gleb Fotengauer-Malinovskiy
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Alexey Gladkov
    Link: https://lkml.kernel.org/r/024ec805f6e16896f0b23e094773790d171d2c1c.1638218242.git.legion@kernel.org
    Signed-off-by: Eric W. Biederman
    Signed-off-by: Sasha Levin

    Alexey Gladkov
     

19 Oct, 2021

1 commit

  • In commit fda31c50292a ("signal: avoid double atomic counter
    increments for user accounting") Linus made a clever optimization to
    how rlimits and the struct user_struct. Unfortunately that
    optimization does not work in the obvious way when moved to nested
    rlimits. The problem is that the last decrement of the per user
    namespace per user sigpending counter might also be the last decrement
    of the sigpending counter in the parent user namespace as well. Which
    means that simply freeing the leaf ucount in __free_sigqueue is not
    enough.

    Maintain the optimization and handle the tricky cases by introducing
    inc_rlimit_get_ucounts and dec_rlimit_put_ucounts.

    By moving the entire optimization into functions that perform all of
    the work it becomes possible to ensure that every level is handled
    properly.

    The new function inc_rlimit_get_ucounts returns 0 on failure to
    increment the ucount. This is different than inc_rlimit_ucounts which
    increments the ucounts and returns LONG_MAX if the ucount counter has
    exceeded it's maximum or it wrapped (to indicate the counter needs to
    decremented).

    I wish we had a single user to account all pending signals to across
    all of the threads of a process so this complexity was not necessary

    Cc: stable@vger.kernel.org
    Fixes: d64696905554 ("Reimplement RLIMIT_SIGPENDING on top of ucounts")
    v1: https://lkml.kernel.org/r/87mtnavszx.fsf_-_@disp2133
    Link: https://lkml.kernel.org/r/87fssytizw.fsf_-_@disp2133
    Reviewed-by: Alexey Gladkov
    Tested-by: Rune Kleveland
    Tested-by: Yu Zhao
    Tested-by: Jordan Glover
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

10 Aug, 2021

1 commit

  • commit f9c82a4ea89c3 ("Increase size of ucounts to atomic_long_t")
    changed the data type of ucounts/ucounts_max to long, but missed to
    adjust a few other places. This is noticeable on big endian platforms
    from user space because the /proc/sys/user/max_*_names files all
    contain 0.

    v4 - Made the min and max constants long so the sysctl values
    are actually settable on little endian machines.
    -- EWB

    Fixes: f9c82a4ea89c ("Increase size of ucounts to atomic_long_t")
    Signed-off-by: Sven Schnelle
    Tested-by: Nathan Chancellor
    Tested-by: Linux Kernel Functional Testing
    Acked-by: Alexey Gladkov
    v1: https://lkml.kernel.org/r/20210721115800.910778-1-svens@linux.ibm.com
    v2: https://lkml.kernel.org/r/20210721125233.1041429-1-svens@linux.ibm.com
    v3: https://lkml.kernel.org/r/20210730062854.3601635-1-svens@linux.ibm.com
    Link: https://lkml.kernel.org/r/8735rijqlv.fsf_-_@disp2133
    Signed-off-by: Eric W. Biederman

    Sven Schnelle
     

29 Jul, 2021

1 commit

  • The race happens because put_ucounts() doesn't use spinlock and
    get_ucounts is not under spinlock:

    CPU0 CPU1
    ---- ----
    alloc_ucounts() put_ucounts()

    spin_lock_irq(&ucounts_lock);
    ucounts = find_ucounts(ns, uid, hashent);

    atomic_dec_and_test(&ucounts->count))

    spin_unlock_irq(&ucounts_lock);

    spin_lock_irqsave(&ucounts_lock, flags);
    hlist_del_init(&ucounts->node);
    spin_unlock_irqrestore(&ucounts_lock, flags);
    kfree(ucounts);

    ucounts = get_ucounts(ucounts);

    ==================================================================
    BUG: KASAN: use-after-free in instrument_atomic_read_write include/linux/instrumented.h:101 [inline]
    BUG: KASAN: use-after-free in atomic_add_negative include/asm-generic/atomic-instrumented.h:556 [inline]
    BUG: KASAN: use-after-free in get_ucounts kernel/ucount.c:152 [inline]
    BUG: KASAN: use-after-free in get_ucounts kernel/ucount.c:150 [inline]
    BUG: KASAN: use-after-free in alloc_ucounts+0x19b/0x5b0 kernel/ucount.c:188
    Write of size 4 at addr ffff88802821e41c by task syz-executor.4/16785

    CPU: 1 PID: 16785 Comm: syz-executor.4 Not tainted 5.14.0-rc1-next-20210712-syzkaller #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:
    __dump_stack lib/dump_stack.c:88 [inline]
    dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:105
    print_address_description.constprop.0.cold+0x6c/0x309 mm/kasan/report.c:233
    __kasan_report mm/kasan/report.c:419 [inline]
    kasan_report.cold+0x83/0xdf mm/kasan/report.c:436
    check_region_inline mm/kasan/generic.c:183 [inline]
    kasan_check_range+0x13d/0x180 mm/kasan/generic.c:189
    instrument_atomic_read_write include/linux/instrumented.h:101 [inline]
    atomic_add_negative include/asm-generic/atomic-instrumented.h:556 [inline]
    get_ucounts kernel/ucount.c:152 [inline]
    get_ucounts kernel/ucount.c:150 [inline]
    alloc_ucounts+0x19b/0x5b0 kernel/ucount.c:188
    set_cred_ucounts+0x171/0x3a0 kernel/cred.c:684
    __sys_setuid+0x285/0x400 kernel/sys.c:623
    do_syscall_x64 arch/x86/entry/common.c:50 [inline]
    do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
    entry_SYSCALL_64_after_hwframe+0x44/0xae
    RIP: 0033:0x4665d9
    Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 3d 01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48
    RSP: 002b:00007fde54097188 EFLAGS: 00000246 ORIG_RAX: 0000000000000069
    RAX: ffffffffffffffda RBX: 000000000056bf80 RCX: 00000000004665d9
    RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000000000ff
    RBP: 00000000004bfcb9 R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000246 R12: 000000000056bf80
    R13: 00007ffc8655740f R14: 00007fde54097300 R15: 0000000000022000

    Allocated by task 16784:
    kasan_save_stack+0x1b/0x40 mm/kasan/common.c:38
    kasan_set_track mm/kasan/common.c:46 [inline]
    set_alloc_info mm/kasan/common.c:434 [inline]
    ____kasan_kmalloc mm/kasan/common.c:513 [inline]
    ____kasan_kmalloc mm/kasan/common.c:472 [inline]
    __kasan_kmalloc+0x9b/0xd0 mm/kasan/common.c:522
    kmalloc include/linux/slab.h:591 [inline]
    kzalloc include/linux/slab.h:721 [inline]
    alloc_ucounts+0x23d/0x5b0 kernel/ucount.c:169
    set_cred_ucounts+0x171/0x3a0 kernel/cred.c:684
    __sys_setuid+0x285/0x400 kernel/sys.c:623
    do_syscall_x64 arch/x86/entry/common.c:50 [inline]
    do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
    entry_SYSCALL_64_after_hwframe+0x44/0xae

    Freed by task 16785:
    kasan_save_stack+0x1b/0x40 mm/kasan/common.c:38
    kasan_set_track+0x1c/0x30 mm/kasan/common.c:46
    kasan_set_free_info+0x20/0x30 mm/kasan/generic.c:360
    ____kasan_slab_free mm/kasan/common.c:366 [inline]
    ____kasan_slab_free mm/kasan/common.c:328 [inline]
    __kasan_slab_free+0xfb/0x130 mm/kasan/common.c:374
    kasan_slab_free include/linux/kasan.h:229 [inline]
    slab_free_hook mm/slub.c:1650 [inline]
    slab_free_freelist_hook+0xdf/0x240 mm/slub.c:1675
    slab_free mm/slub.c:3235 [inline]
    kfree+0xeb/0x650 mm/slub.c:4295
    put_ucounts kernel/ucount.c:200 [inline]
    put_ucounts+0x117/0x150 kernel/ucount.c:192
    put_cred_rcu+0x27a/0x520 kernel/cred.c:124
    rcu_do_batch kernel/rcu/tree.c:2550 [inline]
    rcu_core+0x7ab/0x1380 kernel/rcu/tree.c:2785
    __do_softirq+0x29b/0x9c2 kernel/softirq.c:558

    Last potentially related work creation:
    kasan_save_stack+0x1b/0x40 mm/kasan/common.c:38
    kasan_record_aux_stack+0xe5/0x110 mm/kasan/generic.c:348
    insert_work+0x48/0x370 kernel/workqueue.c:1332
    __queue_work+0x5c1/0xed0 kernel/workqueue.c:1498
    queue_work_on+0xee/0x110 kernel/workqueue.c:1525
    queue_work include/linux/workqueue.h:507 [inline]
    call_usermodehelper_exec+0x1f0/0x4c0 kernel/umh.c:435
    kobject_uevent_env+0xf8f/0x1650 lib/kobject_uevent.c:618
    netdev_queue_add_kobject net/core/net-sysfs.c:1621 [inline]
    netdev_queue_update_kobjects+0x374/0x450 net/core/net-sysfs.c:1655
    register_queue_kobjects net/core/net-sysfs.c:1716 [inline]
    netdev_register_kobject+0x35a/0x430 net/core/net-sysfs.c:1959
    register_netdevice+0xd33/0x1500 net/core/dev.c:10331
    nsim_init_netdevsim drivers/net/netdevsim/netdev.c:317 [inline]
    nsim_create+0x381/0x4d0 drivers/net/netdevsim/netdev.c:364
    __nsim_dev_port_add+0x32e/0x830 drivers/net/netdevsim/dev.c:1295
    nsim_dev_port_add_all+0x53/0x150 drivers/net/netdevsim/dev.c:1355
    nsim_dev_probe+0xcb5/0x1190 drivers/net/netdevsim/dev.c:1496
    call_driver_probe drivers/base/dd.c:517 [inline]
    really_probe+0x23c/0xcd0 drivers/base/dd.c:595
    __driver_probe_device+0x338/0x4d0 drivers/base/dd.c:747
    driver_probe_device+0x4c/0x1a0 drivers/base/dd.c:777
    __device_attach_driver+0x20b/0x2f0 drivers/base/dd.c:894
    bus_for_each_drv+0x15f/0x1e0 drivers/base/bus.c:427
    __device_attach+0x228/0x4a0 drivers/base/dd.c:965
    bus_probe_device+0x1e4/0x290 drivers/base/bus.c:487
    device_add+0xc2f/0x2180 drivers/base/core.c:3356
    nsim_bus_dev_new drivers/net/netdevsim/bus.c:431 [inline]
    new_device_store+0x436/0x710 drivers/net/netdevsim/bus.c:298
    bus_attr_store+0x72/0xa0 drivers/base/bus.c:122
    sysfs_kf_write+0x110/0x160 fs/sysfs/file.c:139
    kernfs_fop_write_iter+0x342/0x500 fs/kernfs/file.c:296
    call_write_iter include/linux/fs.h:2152 [inline]
    new_sync_write+0x426/0x650 fs/read_write.c:518
    vfs_write+0x75a/0xa40 fs/read_write.c:605
    ksys_write+0x12d/0x250 fs/read_write.c:658
    do_syscall_x64 arch/x86/entry/common.c:50 [inline]
    do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
    entry_SYSCALL_64_after_hwframe+0x44/0xae

    Second to last potentially related work creation:
    kasan_save_stack+0x1b/0x40 mm/kasan/common.c:38
    kasan_record_aux_stack+0xe5/0x110 mm/kasan/generic.c:348
    insert_work+0x48/0x370 kernel/workqueue.c:1332
    __queue_work+0x5c1/0xed0 kernel/workqueue.c:1498
    queue_work_on+0xee/0x110 kernel/workqueue.c:1525
    queue_work include/linux/workqueue.h:507 [inline]
    call_usermodehelper_exec+0x1f0/0x4c0 kernel/umh.c:435
    kobject_uevent_env+0xf8f/0x1650 lib/kobject_uevent.c:618
    kobject_synth_uevent+0x701/0x850 lib/kobject_uevent.c:208
    uevent_store+0x20/0x50 drivers/base/core.c:2371
    dev_attr_store+0x50/0x80 drivers/base/core.c:2072
    sysfs_kf_write+0x110/0x160 fs/sysfs/file.c:139
    kernfs_fop_write_iter+0x342/0x500 fs/kernfs/file.c:296
    call_write_iter include/linux/fs.h:2152 [inline]
    new_sync_write+0x426/0x650 fs/read_write.c:518
    vfs_write+0x75a/0xa40 fs/read_write.c:605
    ksys_write+0x12d/0x250 fs/read_write.c:658
    do_syscall_x64 arch/x86/entry/common.c:50 [inline]
    do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
    entry_SYSCALL_64_after_hwframe+0x44/0xae

    The buggy address belongs to the object at ffff88802821e400
    which belongs to the cache kmalloc-192 of size 192
    The buggy address is located 28 bytes inside of
    192-byte region [ffff88802821e400, ffff88802821e4c0)
    The buggy address belongs to the page:
    page:ffffea0000a08780 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x2821e
    flags: 0xfff00000000200(slab|node=0|zone=1|lastcpupid=0x7ff)
    raw: 00fff00000000200 dead000000000100 dead000000000122 ffff888010841a00
    raw: 0000000000000000 0000000080100010 00000001ffffffff 0000000000000000
    page dumped because: kasan: bad access detected
    page_owner tracks the page as allocated
    page last allocated via order 0, migratetype Unmovable, gfp_mask 0x12cc0(GFP_KERNEL|__GFP_NOWARN|__GFP_NORETRY), pid 1, ts 12874702440, free_ts 12637793385
    prep_new_page mm/page_alloc.c:2433 [inline]
    get_page_from_freelist+0xa72/0x2f80 mm/page_alloc.c:4166
    __alloc_pages+0x1b2/0x500 mm/page_alloc.c:5374
    alloc_page_interleave+0x1e/0x200 mm/mempolicy.c:2119
    alloc_pages+0x238/0x2a0 mm/mempolicy.c:2242
    alloc_slab_page mm/slub.c:1713 [inline]
    allocate_slab+0x32b/0x4c0 mm/slub.c:1853
    new_slab mm/slub.c:1916 [inline]
    new_slab_objects mm/slub.c:2662 [inline]
    ___slab_alloc+0x4ba/0x820 mm/slub.c:2825
    __slab_alloc.constprop.0+0xa7/0xf0 mm/slub.c:2865
    slab_alloc_node mm/slub.c:2947 [inline]
    slab_alloc mm/slub.c:2989 [inline]
    __kmalloc+0x312/0x330 mm/slub.c:4133
    kmalloc include/linux/slab.h:596 [inline]
    kzalloc include/linux/slab.h:721 [inline]
    __register_sysctl_table+0x112/0x1090 fs/proc/proc_sysctl.c:1318
    rds_tcp_init_net+0x1db/0x4f0 net/rds/tcp.c:551
    ops_init+0xaf/0x470 net/core/net_namespace.c:140
    __register_pernet_operations net/core/net_namespace.c:1137 [inline]
    register_pernet_operations+0x35a/0x850 net/core/net_namespace.c:1214
    register_pernet_device+0x26/0x70 net/core/net_namespace.c:1301
    rds_tcp_init+0x77/0xe0 net/rds/tcp.c:717
    do_one_initcall+0x103/0x650 init/main.c:1285
    do_initcall_level init/main.c:1360 [inline]
    do_initcalls init/main.c:1376 [inline]
    do_basic_setup init/main.c:1396 [inline]
    kernel_init_freeable+0x6b8/0x741 init/main.c:1598
    page last free stack trace:
    reset_page_owner include/linux/page_owner.h:24 [inline]
    free_pages_prepare mm/page_alloc.c:1343 [inline]
    free_pcp_prepare+0x312/0x7d0 mm/page_alloc.c:1394
    free_unref_page_prepare mm/page_alloc.c:3329 [inline]
    free_unref_page+0x19/0x690 mm/page_alloc.c:3408
    __vunmap+0x783/0xb70 mm/vmalloc.c:2587
    free_work+0x58/0x70 mm/vmalloc.c:82
    process_one_work+0x98d/0x1630 kernel/workqueue.c:2276
    worker_thread+0x658/0x11f0 kernel/workqueue.c:2422
    kthread+0x3e5/0x4d0 kernel/kthread.c:319
    ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295

    Memory state around the buggy address:
    ffff88802821e300: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    ffff88802821e380: 00 00 00 00 00 fc fc fc fc fc fc fc fc fc fc fc
    >ffff88802821e400: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    ^
    ffff88802821e480: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
    ffff88802821e500: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    ==================================================================

    - The race fix has two parts.
    * Changing the code to guarantee that ucounts->count is only decremented
    when ucounts_lock is held. This guarantees that find_ucounts
    will never find a structure with a zero reference count.
    * Changing alloc_ucounts to increment ucounts->count while
    ucounts_lock is held. This guarantees the reference count on the
    found data structure will not be decremented to zero (and the data
    structure freed) before the reference count is incremented.
    -- Eric Biederman

    Reported-by: syzbot+01985d7909f9468f013c@syzkaller.appspotmail.com
    Reported-by: syzbot+59dd63761094a80ad06d@syzkaller.appspotmail.com
    Reported-by: syzbot+6cd79f45bb8fa1c9eeae@syzkaller.appspotmail.com
    Reported-by: syzbot+b6e65bd125a05f803d6b@syzkaller.appspotmail.com
    Fixes: b6c336528926 ("Use atomic_t for ucounts reference counting")
    Cc: Hillf Danton
    Signed-off-by: Alexey Gladkov
    Link: https://lkml.kernel.org/r/7b2ace1759b281cdd2d66101d6b305deef722efb.1627397820.git.legion@kernel.org
    Signed-off-by: Eric W. Biederman

    Alexey Gladkov
     

29 Jun, 2021

1 commit

  • Pull user namespace rlimit handling update from Eric Biederman:
    "This is the work mainly by Alexey Gladkov to limit rlimits to the
    rlimits of the user that created a user namespace, and to allow users
    to have stricter limits on the resources created within a user
    namespace."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    cred: add missing return error code when set_cred_ucounts() failed
    ucounts: Silence warning in dec_rlimit_ucounts
    ucounts: Set ucount_max to the largest positive value the type can hold
    kselftests: Add test to check for rlimit changes in different user namespaces
    Reimplement RLIMIT_MEMLOCK on top of ucounts
    Reimplement RLIMIT_SIGPENDING on top of ucounts
    Reimplement RLIMIT_MSGQUEUE on top of ucounts
    Reimplement RLIMIT_NPROC on top of ucounts
    Use atomic_t for ucounts reference counting
    Add a reference to ucounts for each cred
    Increase size of ucounts to atomic_long_t

    Linus Torvalds
     

01 May, 2021

8 commits

  • Dan Carpenter wrote:
    >
    > url: https://github.com/0day-ci/linux/commits/legion-kernel-org/Count-rlimits-in-each-user-namespace/20210427-162857
    > base: https://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest.git next
    > config: arc-randconfig-m031-20210426 (attached as .config)
    > compiler: arceb-elf-gcc (GCC) 9.3.0
    >
    > If you fix the issue, kindly add following tag as appropriate
    > Reported-by: kernel test robot
    > Reported-by: Dan Carpenter
    >
    > smatch warnings:
    > kernel/ucount.c:270 dec_rlimit_ucounts() error: uninitialized symbol 'new'.
    >
    > vim +/new +270 kernel/ucount.c
    >
    > 176ec2b092cc22 Alexey Gladkov 2021-04-22 260 bool dec_rlimit_ucounts(struct ucounts *ucounts, enum ucount_type type, long v)
    > 176ec2b092cc22 Alexey Gladkov 2021-04-22 261 {
    > 176ec2b092cc22 Alexey Gladkov 2021-04-22 262 struct ucounts *iter;
    > 176ec2b092cc22 Alexey Gladkov 2021-04-22 263 long new;
    > ^^^^^^^^
    >
    > 176ec2b092cc22 Alexey Gladkov 2021-04-22 264 for (iter = ucounts; iter; iter = iter->ns->ucounts) {
    > 176ec2b092cc22 Alexey Gladkov 2021-04-22 265 long dec = atomic_long_add_return(-v, &iter->ucount[type]);
    > 176ec2b092cc22 Alexey Gladkov 2021-04-22 266 WARN_ON_ONCE(dec < 0);
    > 176ec2b092cc22 Alexey Gladkov 2021-04-22 267 if (iter == ucounts)
    > 176ec2b092cc22 Alexey Gladkov 2021-04-22 268 new = dec;
    > 176ec2b092cc22 Alexey Gladkov 2021-04-22 269 }
    > 176ec2b092cc22 Alexey Gladkov 2021-04-22 @270 return (new == 0);
    > ^^^^^^^^
    > I don't know if this is a bug or not, but I can definitely tell why the
    > static checker complains about it.
    >
    > 176ec2b092cc22 Alexey Gladkov 2021-04-22 271 }

    In the only two cases that care about the return value of
    dec_rlimit_ucounts the code first tests to see that ucounts is not
    NULL. In those cases it is guaranteed at least one iteration of the
    loop will execute guaranteeing the variable new will be initialized.

    Initialize new to -1 so that the return value is well defined even
    when the loop does not execute and the static checker is silenced.

    Link: https://lkml.kernel.org/r/m1tunny77w.fsf@fess.ebiederm.org
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • The rlimit counter is tied to uid in the user_namespace. This allows
    rlimit values to be specified in userns even if they are already
    globally exceeded by the user. However, the value of the previous
    user_namespaces cannot be exceeded.

    Changelog

    v11:
    * Fix issue found by lkp robot.

    v8:
    * Fix issues found by lkp-tests project.

    v7:
    * Keep only ucounts for RLIMIT_MEMLOCK checks instead of struct cred.

    v6:
    * Fix bug in hugetlb_file_setup() detected by trinity.

    Reported-by: kernel test robot
    Reported-by: kernel test robot
    Signed-off-by: Alexey Gladkov
    Link: https://lkml.kernel.org/r/970d50c70c71bfd4496e0e8d2a0a32feebebb350.1619094428.git.legion@kernel.org
    Signed-off-by: Eric W. Biederman

    Alexey Gladkov
     
  • The rlimit counter is tied to uid in the user_namespace. This allows
    rlimit values to be specified in userns even if they are already
    globally exceeded by the user. However, the value of the previous
    user_namespaces cannot be exceeded.

    Changelog

    v11:
    * Revert most of changes to fix performance issues.

    v10:
    * Fix memory leak on get_ucounts failure.

    Signed-off-by: Alexey Gladkov
    Link: https://lkml.kernel.org/r/df9d7764dddd50f28616b7840de74ec0f81711a8.1619094428.git.legion@kernel.org
    Signed-off-by: Eric W. Biederman

    Alexey Gladkov
     
  • The rlimit counter is tied to uid in the user_namespace. This allows
    rlimit values to be specified in userns even if they are already
    globally exceeded by the user. However, the value of the previous
    user_namespaces cannot be exceeded.

    Signed-off-by: Alexey Gladkov
    Link: https://lkml.kernel.org/r/2531f42f7884bbfee56a978040b3e0d25cdf6cde.1619094428.git.legion@kernel.org
    Signed-off-by: Eric W. Biederman

    Alexey Gladkov
     
  • The rlimit counter is tied to uid in the user_namespace. This allows
    rlimit values to be specified in userns even if they are already
    globally exceeded by the user. However, the value of the previous
    user_namespaces cannot be exceeded.

    To illustrate the impact of rlimits, let's say there is a program that
    does not fork. Some service-A wants to run this program as user X in
    multiple containers. Since the program never fork the service wants to
    set RLIMIT_NPROC=1.

    service-A
    \- program (uid=1000, container1, rlimit_nproc=1)
    \- program (uid=1000, container2, rlimit_nproc=1)

    The service-A sets RLIMIT_NPROC=1 and runs the program in container1.
    When the service-A tries to run a program with RLIMIT_NPROC=1 in
    container2 it fails since user X already has one running process.

    We cannot use existing inc_ucounts / dec_ucounts because they do not
    allow us to exceed the maximum for the counter. Some rlimits can be
    overlimited by root or if the user has the appropriate capability.

    Changelog

    v11:
    * Change inc_rlimit_ucounts() which now returns top value of ucounts.
    * Drop inc_rlimit_ucounts_and_test() because the return code of
    inc_rlimit_ucounts() can be checked.

    Signed-off-by: Alexey Gladkov
    Link: https://lkml.kernel.org/r/c5286a8aa16d2d698c222f7532f3d735c82bc6bc.1619094428.git.legion@kernel.org
    Signed-off-by: Eric W. Biederman

    Alexey Gladkov
     
  • The current implementation of the ucounts reference counter requires the
    use of spin_lock. We're going to use get_ucounts() in more performance
    critical areas like a handling of RLIMIT_SIGPENDING.

    Now we need to use spin_lock only if we want to change the hashtable.

    v10:
    * Always try to put ucounts in case we cannot increase ucounts->count.
    This will allow to cover the case when all consumers will return
    ucounts at once.

    v9:
    * Use a negative value to check that the ucounts->count is close to
    overflow.

    Signed-off-by: Alexey Gladkov
    Link: https://lkml.kernel.org/r/94d1dbecab060a6b116b0a2d1accd8ca1bbb4f5f.1619094428.git.legion@kernel.org
    Signed-off-by: Eric W. Biederman

    Alexey Gladkov
     
  • For RLIMIT_NPROC and some other rlimits the user_struct that holds the
    global limit is kept alive for the lifetime of a process by keeping it
    in struct cred. Adding a pointer to ucounts in the struct cred will
    allow to track RLIMIT_NPROC not only for user in the system, but for
    user in the user_namespace.

    Updating ucounts may require memory allocation which may fail. So, we
    cannot change cred.ucounts in the commit_creds() because this function
    cannot fail and it should always return 0. For this reason, we modify
    cred.ucounts before calling the commit_creds().

    Changelog

    v6:
    * Fix null-ptr-deref in is_ucounts_overlimit() detected by trinity. This
    error was caused by the fact that cred_alloc_blank() left the ucounts
    pointer empty.

    Reported-by: kernel test robot
    Signed-off-by: Alexey Gladkov
    Link: https://lkml.kernel.org/r/b37aaef28d8b9b0d757e07ba6dd27281bbe39259.1619094428.git.legion@kernel.org
    Signed-off-by: Eric W. Biederman

    Alexey Gladkov
     
  • RLIMIT_MSGQUEUE and RLIMIT_MEMLOCK use unsigned long to store their
    counters. As a preparation for moving rlimits based on ucounts, we need
    to increase the size of the variable to long.

    Signed-off-by: Alexey Gladkov
    Link: https://lkml.kernel.org/r/257aa5fb1a7d81cf0f4c34f39ada2320c4284771.1619094428.git.legion@kernel.org
    Signed-off-by: Eric W. Biederman

    Alexey Gladkov
     

16 Mar, 2021

1 commit

  • fanotify has some hardcoded limits. The only APIs to escape those limits
    are FAN_UNLIMITED_QUEUE and FAN_UNLIMITED_MARKS.

    Allow finer grained tuning of the system limits via sysfs tunables under
    /proc/sys/fs/fanotify, similar to tunables under /proc/sys/fs/inotify,
    with some minor differences.

    - max_queued_events - global system tunable for group queue size limit.
    Like the inotify tunable with the same name, it defaults to 16384 and
    applies on initialization of a new group.

    - max_user_marks - user ns tunable for marks limit per user.
    Like the inotify tunable named max_user_watches, on a machine with
    sufficient RAM and it defaults to 1048576 in init userns and can be
    further limited per containing user ns.

    - max_user_groups - user ns tunable for number of groups per user.
    Like the inotify tunable named max_user_instances, it defaults to 128
    in init userns and can be further limited per containing user ns.

    The slightly different tunable names used for fanotify are derived from
    the "group" and "mark" terminology used in the fanotify man pages and
    throughout the code.

    Considering the fact that the default value for max_user_instances was
    increased in kernel v5.10 from 8192 to 1048576, leaving the legacy
    fanotify limit of 8192 marks per group in addition to the max_user_marks
    limit makes little sense, so the per group marks limit has been removed.

    Note that when a group is initialized with FAN_UNLIMITED_MARKS, its own
    marks are not accounted in the per user marks account, so in effect the
    limit of max_user_marks is only for the collection of groups that are
    not initialized with FAN_UNLIMITED_MARKS.

    Link: https://lore.kernel.org/r/20210304112921.3996419-2-amir73il@gmail.com
    Suggested-by: Jan Kara
    Signed-off-by: Amir Goldstein
    Signed-off-by: Jan Kara

    Amir Goldstein
     

08 Apr, 2020

1 commit

  • Commit 769071ac9f20 "ns: Introduce Time Namespace" broke reporting of
    inotify ucounts (max_inotify_instances, max_inotify_watches) in
    /proc/sys/user because it has added UCOUNT_TIME_NAMESPACES into enum
    ucount_type but didn't properly update reporting in
    kernel/ucount.c:setup_userns_sysctls(). This problem got fixed in commit
    eeec26d5da82 "time/namespace: Add max_time_namespaces ucount".

    Add BUILD_BUG_ON to catch a similar problem in the future.

    Signed-off-by: Jan Kara
    Signed-off-by: Thomas Gleixner
    Acked-by: Andrei Vagin
    Link: https://lkml.kernel.org/r/20200407154643.10102-1-jack@suse.cz

    Jan Kara
     

07 Apr, 2020

1 commit

  • Michael noticed that userns limit for number of time namespaces is missing.

    Furthermore, time namespace introduced UCOUNT_TIME_NAMESPACES, but didn't
    introduce an array member in user_table[]. It would make array's
    initialisation OOB write, but by luck the user_table array has an excessive
    empty member (all accesses to the array are limited with UCOUNT_COUNTS - so
    it silently reuses the last free member.

    Fixes user-visible regression: max_inotify_instances by reason of the
    missing UCOUNT_ENTRY() has limited max number of namespaces instead of the
    number of inotify instances.

    Fixes: 769071ac9f20 ("ns: Introduce Time Namespace")
    Reported-by: Michael Kerrisk (man-pages)
    Signed-off-by: Dmitry Safonov
    Signed-off-by: Thomas Gleixner
    Acked-by: Andrei Vagin
    Acked-by: Vincenzo Frascino
    Cc: stable@kernel.org
    Link: https://lkml.kernel.org/r/20200406171342.128733-1-dima@arista.com

    Dmitry Safonov
     

19 Jul, 2019

1 commit

  • In the sysctl code the proc_dointvec_minmax() function is often used to
    validate the user supplied value between an allowed range. This
    function uses the extra1 and extra2 members from struct ctl_table as
    minimum and maximum allowed value.

    On sysctl handler declaration, in every source file there are some
    readonly variables containing just an integer which address is assigned
    to the extra1 and extra2 members, so the sysctl range is enforced.

    The special values 0, 1 and INT_MAX are very often used as range
    boundary, leading duplication of variables like zero=0, one=1,
    int_max=INT_MAX in different source files:

    $ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
    248

    Add a const int array containing the most commonly used values, some
    macros to refer more easily to the correct array member, and use them
    instead of creating a local one for every object file.

    This is the bloat-o-meter output comparing the old and new binary
    compiled with the default Fedora config:

    # scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
    add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
    Data old new delta
    sysctl_vals - 12 +12
    __kstrtab_sysctl_vals - 12 +12
    max 14 10 -4
    int_max 16 - -16
    one 68 - -68
    zero 128 28 -100
    Total: Before=20583249, After=20583085, chg -0.00%

    [mcroce@redhat.com: tipc: remove two unused variables]
    Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
    [akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
    [arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
    Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
    [akpm@linux-foundation.org: fix fs/eventpoll.c]
    Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
    Signed-off-by: Matteo Croce
    Signed-off-by: Arnd Bergmann
    Acked-by: Kees Cook
    Reviewed-by: Aaron Tomlin
    Cc: Matthew Wilcox
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matteo Croce
     

05 Jun, 2019

1 commit

  • Based on 1 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license as published by
    the free software foundation version 2 of the license

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-only

    has been chosen to replace the boilerplate/reference in 315 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Allison Randal
    Reviewed-by: Armijn Hemel
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190531190115.503150771@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

06 Apr, 2018

1 commit

  • Currently #includes for no obvious
    reason. It looks like it's only a convenience, so remove kmemleak.h
    from slab.h and add to any users of kmemleak_* that
    don't already #include it. Also remove from source
    files that do not use it.

    This is tested on i386 allmodconfig and x86_64 allmodconfig. It would
    be good to run it through the 0day bot for other $ARCHes. I have
    neither the horsepower nor the storage space for the other $ARCHes.

    Update: This patch has been extensively build-tested by both the 0day
    bot & kisskb/ozlabs build farms. Both of them reported 2 build failures
    for which patches are included here (in v2).

    [ slab.h is the second most used header file after module.h; kernel.h is
    right there with slab.h. There could be some minor error in the
    counting due to some #includes having comments after them and I didn't
    combine all of those. ]

    [akpm@linux-foundation.org: security/keys/big_key.c needs vmalloc.h, per sfr]
    Link: http://lkml.kernel.org/r/e4309f98-3749-93e1-4bb7-d9501a39d015@infradead.org
    Link: http://kisskb.ellerman.id.au/kisskb/head/13396/
    Signed-off-by: Randy Dunlap
    Reviewed-by: Ingo Molnar
    Reported-by: Michael Ellerman [2 build failures]
    Reported-by: Fengguang Wu [2 build failures]
    Reviewed-by: Andrew Morton
    Cc: Wei Yongjun
    Cc: Luis R. Rodriguez
    Cc: Greg Kroah-Hartman
    Cc: Mimi Zohar
    Cc: John Johansen
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     

07 Mar, 2017

1 commit

  • Always increment/decrement ucount->count under the ucounts_lock. The
    increments are there already and moving the decrements there means the
    locking logic of the code is simpler. This simplification in the
    locking logic fixes a race between put_ucounts and get_ucounts that
    could result in a use-after-free because the count could go zero then
    be found by get_ucounts and then be freed by put_ucounts.

    A bug presumably this one was found by a combination of syzkaller and
    KASAN. JongWhan Kim reported the syzkaller failure and Dmitry Vyukov
    spotted the race in the code.

    Cc: stable@vger.kernel.org
    Fixes: f6b2db1a3e8d ("userns: Make the count of user namespaces per user")
    Reported-by: JongHwan Kim
    Reported-by: Dmitry Vyukov
    Reviewed-by: Andrei Vagin
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

02 Mar, 2017

1 commit


24 Feb, 2017

1 commit

  • Pull namespace updates from Eric Biederman:
    "There is a lot here. A lot of these changes result in subtle user
    visible differences in kernel behavior. I don't expect anything will
    care but I will revert/fix things immediately if any regressions show
    up.

    From Seth Forshee there is a continuation of the work to make the vfs
    ready for unpriviled mounts. We had thought the previous changes
    prevented the creation of files outside of s_user_ns of a filesystem,
    but it turns we missed the O_CREAT path. Ooops.

    Pavel Tikhomirov and Oleg Nesterov worked together to fix a long
    standing bug in the implemenation of PR_SET_CHILD_SUBREAPER where only
    children that are forked after the prctl are considered and not
    children forked before the prctl. The only known user of this prctl
    systemd forks all children after the prctl. So no userspace
    regressions will occur. Holding earlier forked children to the same
    rules as later forked children creates a semantic that is sane enough
    to allow checkpoing of processes that use this feature.

    There is a long delayed change by Nikolay Borisov to limit inotify
    instances inside a user namespace.

    Michael Kerrisk extends the API for files used to maniuplate
    namespaces with two new trivial ioctls to allow discovery of the
    hierachy and properties of namespaces.

    Konstantin Khlebnikov with the help of Al Viro adds code that when a
    network namespace exits purges it's sysctl entries from the dcache. As
    in some circumstances this could use a lot of memory.

    Vivek Goyal fixed a bug with stacked filesystems where the permissions
    on the wrong inode were being checked.

    I continue previous work on ptracing across exec. Allowing a file to
    be setuid across exec while being ptraced if the tracer has enough
    credentials in the user namespace, and if the process has CAP_SETUID
    in it's own namespace. Proc files for setuid or otherwise undumpable
    executables are now owned by the root in the user namespace of their
    mm. Allowing debugging of setuid applications in containers to work
    better.

    A bug I introduced with permission checking and automount is now
    fixed. The big change is to mark the mounts that the kernel initiates
    as a result of an automount. This allows the permission checks in sget
    to be safely suppressed for this kind of mount. As the permission
    check happened when the original filesystem was mounted.

    Finally a special case in the mount namespace is removed preventing
    unbounded chains in the mount hash table, and making the semantics
    simpler which benefits CRIU.

    The vfs fix along with related work in ima and evm I believe makes us
    ready to finish developing and merge fully unprivileged mounts of the
    fuse filesystem. The cleanups of the mount namespace makes discussing
    how to fix the worst case complexity of umount. The stacked filesystem
    fixes pave the way for adding multiple mappings for the filesystem
    uids so that efficient and safer containers can be implemented"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    proc/sysctl: Don't grab i_lock under sysctl_lock.
    vfs: Use upper filesystem inode in bprm_fill_uid()
    proc/sysctl: prune stale dentries during unregistering
    mnt: Tuck mounts under others instead of creating shadow/side mounts.
    prctl: propagate has_child_subreaper flag to every descendant
    introduce the walk_process_tree() helper
    nsfs: Add an ioctl() to return owner UID of a userns
    fs: Better permission checking for submounts
    exit: fix the setns() && PR_SET_CHILD_SUBREAPER interaction
    vfs: open() with O_CREAT should not create inodes with unknown ids
    nsfs: Add an ioctl() to return the namespace type
    proc: Better ownership of files for non-dumpable tasks in user namespaces
    exec: Remove LSM_UNSAFE_PTRACE_CAP
    exec: Test the ptracer's saved cred to see if the tracee can gain caps
    exec: Don't reset euid and egid when the tracee has CAP_SETUID
    inotify: Convert to using per-namespace limits

    Linus Torvalds
     

09 Feb, 2017

1 commit

  • The user_header gets caught by kmemleak with the following splat as
    missing a free:

    unreferenced object 0xffff99667a733d80 (size 96):
    comm "swapper/0", pid 1, jiffies 4294892317 (age 62191.468s)
    hex dump (first 32 bytes):
    a0 b6 92 b4 ff ff ff ff 00 00 00 00 01 00 00 00 ................
    01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    backtrace:
    kmemleak_alloc+0x4a/0xa0
    __kmalloc+0x144/0x260
    __register_sysctl_table+0x54/0x5e0
    register_sysctl+0x1b/0x20
    user_namespace_sysctl_init+0x17/0x34
    do_one_initcall+0x52/0x1a0
    kernel_init_freeable+0x173/0x200
    kernel_init+0xe/0x100
    ret_from_fork+0x2c/0x40

    The BUG_ON()s are intended to crash so no need to clean up after
    ourselves on error there. This is also a kernel/ subsys_init() we don't
    need a respective exit call here as this is never modular, so just white
    list it.

    Link: http://lkml.kernel.org/r/20170203211404.31458-1-mcgrof@kernel.org
    Signed-off-by: Luis R. Rodriguez
    Cc: Eric W. Biederman
    Cc: Kees Cook
    Cc: Nikolay Borisov
    Cc: Serge Hallyn
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Luis R. Rodriguez
     

24 Jan, 2017

2 commits

  • This patchset converts inotify to using the newly introduced
    per-userns sysctl infrastructure.

    Currently the inotify instances/watches are being accounted in the
    user_struct structure. This means that in setups where multiple
    users in unprivileged containers map to the same underlying
    real user (i.e. pointing to the same user_struct) the inotify limits
    are going to be shared as well, allowing one user(or application) to exhaust
    all others limits.

    Fix this by switching the inotify sysctls to using the
    per-namespace/per-user limits. This will allow the server admin to
    set sensible global limits, which can further be tuned inside every
    individual user namespace. Additionally, in order to preserve the
    sysctl ABI make the existing inotify instances/watches sysctls
    modify the values of the initial user namespace.

    Signed-off-by: Nikolay Borisov
    Acked-by: Jan Kara
    Acked-by: Serge Hallyn
    Signed-off-by: Eric W. Biederman

    Nikolay Borisov
     
  • The ucounts_lock is being used to protect various ucounts lifecycle
    management functionalities. However, those services can also be invoked
    when a pidns is being freed in an RCU callback (e.g. softirq context).
    This can lead to deadlocks. There were already efforts trying to
    prevent similar deadlocks in add7c65ca426 ("pid: fix lockdep deadlock
    warning due to ucount_lock"), however they just moved the context
    from hardirq to softrq. Fix this issue once and for all by explictly
    making the lock disable irqs altogether.

    Dmitry Vyukov reported:

    > I've got the following deadlock report while running syzkaller fuzzer
    > on eec0d3d065bfcdf9cd5f56dd2a36b94d12d32297 of linux-next (on odroid
    > device if it matters):
    >
    > =================================
    > [ INFO: inconsistent lock state ]
    > 4.10.0-rc3-next-20170112-xc2-dirty #6 Not tainted
    > ---------------------------------
    > inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
    > swapper/2/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
    > (ucounts_lock){+.?...}, at: [< inline >] spin_lock
    > ./include/linux/spinlock.h:302
    > (ucounts_lock){+.?...}, at: []
    > put_ucounts+0x60/0x138 kernel/ucount.c:162
    > {SOFTIRQ-ON-W} state was registered at:
    > [] mark_lock+0x220/0xb60 kernel/locking/lockdep.c:3054
    > [< inline >] mark_irqflags kernel/locking/lockdep.c:2941
    > [] __lock_acquire+0x388/0x3260 kernel/locking/lockdep.c:3295
    > [] lock_acquire+0xa4/0x138 kernel/locking/lockdep.c:3753
    > [< inline >] __raw_spin_lock ./include/linux/spinlock_api_smp.h:144
    > [] _raw_spin_lock+0x90/0xd0 kernel/locking/spinlock.c:151
    > [< inline >] spin_lock ./include/linux/spinlock.h:302
    > [< inline >] get_ucounts kernel/ucount.c:131
    > [] inc_ucount+0x80/0x6c8 kernel/ucount.c:189
    > [< inline >] inc_mnt_namespaces fs/namespace.c:2818
    > [] alloc_mnt_ns+0x78/0x3a8 fs/namespace.c:2849
    > [] create_mnt_ns+0x28/0x200 fs/namespace.c:2959
    > [< inline >] init_mount_tree fs/namespace.c:3199
    > [] mnt_init+0x258/0x384 fs/namespace.c:3251
    > [] vfs_caches_init+0x6c/0x80 fs/dcache.c:3626
    > [] start_kernel+0x414/0x460 init/main.c:648
    > [] __primary_switched+0x6c/0x70 arch/arm64/kernel/head.S:456
    > irq event stamp: 2316924
    > hardirqs last enabled at (2316924): [< inline >] rcu_do_batch
    > kernel/rcu/tree.c:2911
    > hardirqs last enabled at (2316924): [< inline >]
    > invoke_rcu_callbacks kernel/rcu/tree.c:3182
    > hardirqs last enabled at (2316924): [< inline >]
    > __rcu_process_callbacks kernel/rcu/tree.c:3149
    > hardirqs last enabled at (2316924): []
    > rcu_process_callbacks+0x7a4/0xc28 kernel/rcu/tree.c:3166
    > hardirqs last disabled at (2316923): [< inline >] rcu_do_batch
    > kernel/rcu/tree.c:2900
    > hardirqs last disabled at (2316923): [< inline >]
    > invoke_rcu_callbacks kernel/rcu/tree.c:3182
    > hardirqs last disabled at (2316923): [< inline >]
    > __rcu_process_callbacks kernel/rcu/tree.c:3149
    > hardirqs last disabled at (2316923): []
    > rcu_process_callbacks+0x210/0xc28 kernel/rcu/tree.c:3166
    > softirqs last enabled at (2316912): []
    > _local_bh_enable+0x4c/0x80 kernel/softirq.c:155
    > softirqs last disabled at (2316913): [< inline >]
    > do_softirq_own_stack ./include/linux/interrupt.h:488
    > softirqs last disabled at (2316913): [< inline >]
    > invoke_softirq kernel/softirq.c:371
    > softirqs last disabled at (2316913): []
    > irq_exit+0x264/0x308 kernel/softirq.c:405
    >
    > other info that might help us debug this:
    > Possible unsafe locking scenario:
    >
    > CPU0
    > ----
    > lock(ucounts_lock);
    >
    > lock(ucounts_lock);
    >
    > *** DEADLOCK ***
    >
    > 1 lock held by swapper/2/0:
    > #0: (rcu_callback){......}, at: [< inline >] __rcu_reclaim
    > kernel/rcu/rcu.h:108
    > #0: (rcu_callback){......}, at: [< inline >] rcu_do_batch
    > kernel/rcu/tree.c:2919
    > #0: (rcu_callback){......}, at: [< inline >]
    > invoke_rcu_callbacks kernel/rcu/tree.c:3182
    > #0: (rcu_callback){......}, at: [< inline >]
    > __rcu_process_callbacks kernel/rcu/tree.c:3149
    > #0: (rcu_callback){......}, at: []
    > rcu_process_callbacks+0x720/0xc28 kernel/rcu/tree.c:3166
    >
    > stack backtrace:
    > CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.10.0-rc3-next-20170112-xc2-dirty #6
    > Hardware name: Hardkernel ODROID-C2 (DT)
    > Call trace:
    > [] dump_backtrace+0x0/0x440 arch/arm64/kernel/traps.c:500
    > [] show_stack+0x20/0x30 arch/arm64/kernel/traps.c:225
    > [] dump_stack+0x110/0x168
    > [] print_usage_bug.part.27+0x49c/0x4bc
    > kernel/locking/lockdep.c:2387
    > [< inline >] print_usage_bug kernel/locking/lockdep.c:2357
    > [< inline >] valid_state kernel/locking/lockdep.c:2400
    > [< inline >] mark_lock_irq kernel/locking/lockdep.c:2617
    > [] mark_lock+0x934/0xb60 kernel/locking/lockdep.c:3065
    > [< inline >] mark_irqflags kernel/locking/lockdep.c:2923
    > [] __lock_acquire+0x640/0x3260 kernel/locking/lockdep.c:3295
    > [] lock_acquire+0xa4/0x138 kernel/locking/lockdep.c:3753
    > [< inline >] __raw_spin_lock ./include/linux/spinlock_api_smp.h:144
    > [] _raw_spin_lock+0x90/0xd0 kernel/locking/spinlock.c:151
    > [< inline >] spin_lock ./include/linux/spinlock.h:302
    > [] put_ucounts+0x60/0x138 kernel/ucount.c:162
    > [] dec_ucount+0xf4/0x158 kernel/ucount.c:214
    > [< inline >] dec_pid_namespaces kernel/pid_namespace.c:89
    > [] delayed_free_pidns+0x40/0xe0 kernel/pid_namespace.c:156
    > [< inline >] __rcu_reclaim kernel/rcu/rcu.h:118
    > [< inline >] rcu_do_batch kernel/rcu/tree.c:2919
    > [< inline >] invoke_rcu_callbacks kernel/rcu/tree.c:3182
    > [< inline >] __rcu_process_callbacks kernel/rcu/tree.c:3149
    > [] rcu_process_callbacks+0x768/0xc28 kernel/rcu/tree.c:3166
    > [] __do_softirq+0x324/0x6e0 kernel/softirq.c:284
    > [< inline >] do_softirq_own_stack ./include/linux/interrupt.h:488
    > [< inline >] invoke_softirq kernel/softirq.c:371
    > [] irq_exit+0x264/0x308 kernel/softirq.c:405
    > [] __handle_domain_irq+0xc0/0x150 kernel/irq/irqdesc.c:636
    > [] gic_handle_irq+0x68/0xd8
    > Exception stack(0xffff8000648e7dd0 to 0xffff8000648e7f00)
    > 7dc0: ffff8000648d4b3c 0000000000000007
    > 7de0: 0000000000000000 1ffff0000c91a967 1ffff0000c91a967 1ffff0000c91a967
    > 7e00: ffff20000a4b6b68 0000000000000001 0000000000000007 0000000000000001
    > 7e20: 1fffe4000149ae90 ffff200009d35000 0000000000000000 0000000000000002
    > 7e40: 0000000000000000 0000000000000000 0000000002624a1a 0000000000000000
    > 7e60: 0000000000000000 ffff200009cbcd88 000060006d2ed000 0000000000000140
    > 7e80: ffff200009cff000 ffff200009cb6000 ffff200009cc2020 ffff200009d2159d
    > 7ea0: 0000000000000000 ffff8000648d4380 0000000000000000 ffff8000648e7f00
    > 7ec0: ffff20000820a478 ffff8000648e7f00 ffff20000820a47c 0000000010000145
    > 7ee0: 0000000000000140 dfff200000000000 ffffffffffffffff ffff20000820a478
    > [] el1_irq+0xb8/0x130 arch/arm64/kernel/entry.S:486
    > [< inline >] arch_local_irq_restore
    > ./arch/arm64/include/asm/irqflags.h:81
    > [] rcu_idle_exit+0x64/0xa8 kernel/rcu/tree.c:1030
    > [< inline >] cpuidle_idle_call kernel/sched/idle.c:200
    > [] do_idle+0x1dc/0x2d0 kernel/sched/idle.c:243
    > [] cpu_startup_entry+0x24/0x28 kernel/sched/idle.c:345
    > [] secondary_start_kernel+0x2cc/0x358
    > arch/arm64/kernel/smp.c:276
    > [] 0x279f1a4

    Reported-by: Dmitry Vyukov
    Tested-by: Dmitry Vyukov
    Fixes: add7c65ca426 ("pid: fix lockdep deadlock warning due to ucount_lock")
    Fixes: f333c700c610 ("pidns: Add a limit on the number of pid namespaces")
    Cc: stable@vger.kernel.org
    Link: https://www.spinics.net/lists/kernel/msg2426637.html
    Signed-off-by: Nikolay Borisov
    Signed-off-by: Eric W. Biederman

    Nikolay Borisov
     

31 Aug, 2016

1 commit


09 Aug, 2016

9 commits