06 Apr, 2019

40 commits

  • [ Upstream commit dce30ca9e3b676fb288c33c1f4725a0621361185 ]

    guard_bio_eod() can truncate a segment in bio to allow it to do IO on
    odd last sectors of a device.

    It already checks if the IO starts past EOD, but it does not consider
    the possibility of an IO request starting within device boundaries can
    contain more than one segment past EOD.

    In such cases, truncated_bytes can be bigger than PAGE_SIZE, and will
    underflow bvec->bv_len.

    Fix this by checking if truncated_bytes is lower than PAGE_SIZE.

    This situation has been found on filesystems such as isofs and vfat,
    which doesn't check the device size before mount, if the device is
    smaller than the filesystem itself, a readahead on such filesystem,
    which spans EOD, can trigger this situation, leading a call to
    zero_user() with a wrong size possibly corrupting memory.

    I didn't see any crash, or didn't let the system run long enough to
    check if memory corruption will be hit somewhere, but adding
    instrumentation to guard_bio_end() to check truncated_bytes size, was
    enough to see the error.

    The following script can trigger the error.

    MNT=/mnt
    IMG=./DISK.img
    DEV=/dev/loop0

    mkfs.vfat $IMG
    mount $IMG $MNT
    cp -R /etc $MNT &> /dev/null
    umount $MNT

    losetup -D

    losetup --find --show --sizelimit 16247280 $IMG
    mount $DEV $MNT

    find $MNT -type f -exec cat {} + >/dev/null

    Kudos to Eric Sandeen for coming up with the reproducer above

    Reviewed-by: Ming Lei
    Signed-off-by: Carlos Maiolino
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Carlos Maiolino
     
  • [ Upstream commit 6e876c3dd205d30b0db6850e97a03d75457df007 ]

    In jbd2_journal_commit_transaction(), if we are in abort mode,
    we may flush the buffer without setting descriptor block checksum
    by goto start_journal_io. Then fs is mounted,
    jbd2_descriptor_block_csum_verify() failed.

    [ 271.379811] EXT4-fs (vdd): shut down requested (2)
    [ 271.381827] Aborting journal on device vdd-8.
    [ 271.597136] JBD2: Invalid checksum recovering block 22199 in log
    [ 271.598023] JBD2: recovery failed
    [ 271.598484] EXT4-fs (vdd): error loading journal

    Fix this problem by keep setting descriptor block checksum if the
    descriptor buffer is not NULL.

    This checksum problem can be reproduced by xfstests generic/388.

    Signed-off-by: luojiajun
    Signed-off-by: Theodore Ts'o
    Reviewed-by: Jan Kara
    Signed-off-by: Sasha Levin

    luojiajun
     
  • [ Upstream commit be0502a3f2e94211a8809a09ecbc3a017189b8fb ]

    TCP resets cause instant transition from established to closed state
    provided the reset is in-window. Endpoints that implement RFC 5961
    require resets to match the next expected sequence number.
    RST segments that are in-window (but that do not match RCV.NXT) are
    ignored, and a "challenge ACK" is sent back.

    Main problem for conntrack is that its a middlebox, i.e. whereas an end
    host might have ACK'd SEQ (and would thus accept an RST with this
    sequence number), conntrack might not have seen this ACK (yet).

    Therefore we can't simply flag RSTs with non-exact match as invalid.

    This updates RST processing as follows:

    1. If the connection is in a state other than ESTABLISHED, nothing is
    changed, RST is subject to normal in-window check.

    2. If the RSTs sequence number either matches exactly RCV.NXT,
    connection state moves to CLOSE.

    3. The same applies if the RST sequence number aligns with a previous
    packet in the same direction.

    In all other cases, the connection remains in ESTABLISHED state.
    If the normal-in-window check passes, the timeout will be lowered
    to that of CLOSE.

    If the peer sends a challenge ack, connection timeout will be reset.

    If the challenge ACK triggers another RST (RST was valid after all),
    this 2nd RST will match expected sequence and conntrack state changes to
    CLOSE.

    If no challenge ACK is received, the connection will time out after
    CLOSE seconds (10 seconds by default), just like without this patch.

    Packetdrill test case:

    0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
    0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
    0.000 bind(3, ..., ...) = 0
    0.000 listen(3, 1) = 0

    0.100 < S 0:0(0) win 32792
    0.100 > S. 0:0(0) ack 1 win 64240
    0.200 < . 1:1(0) ack 1 win 257
    0.200 accept(3, ..., ...) = 4

    // Receive a segment.
    0.210 < P. 1:1001(1000) ack 1 win 46
    0.210 > . 1:1(0) ack 1001

    // Application writes 1000 bytes.
    0.250 write(4, ..., 1000) = 1000
    0.250 > P. 1:1001(1000) ack 1001

    // First reset, old sequence. Conntrack (correctly) considers this
    // invalid due to failed window validation (regardless of this patch).
    0.260 < R 2:2(0) ack 1001 win 260

    // 2nd reset, but too far ahead sequence. Same: correctly handled
    // as invalid.
    0.270 < R 99990001:99990001(0) ack 1001 win 260

    // in-window, but not exact sequence.
    // Current Linux kernels might reply with a challenge ack, and do not
    // remove connection.
    // Without this patch, conntrack state moves to CLOSE.
    // With patch, timeout is lowered like CLOSE, but connection stays
    // in ESTABLISHED state.
    0.280 < R 1010:1010(0) ack 1001 win 260

    // Expect challenge ACK
    0.281 > . 1001:1001(0) ack 1001 win 501

    // With or without this patch, RST will cause connection
    // to move to CLOSE (sequence number matches)
    // 0.282 < R 1001:1001(0) ack 1001 win 260

    // ACK
    0.300 < . 1001:1001(0) ack 1001 win 257

    // more data could be exchanged here, connection
    // is still established

    // Client closes the connection.
    0.610 < F. 1001:1001(0) ack 1001 win 260
    0.650 > . 1001:1001(0) ack 1002

    // Close the connection without reading outstanding data
    0.700 close(4) = 0

    // so one more reset. Will be deemed acceptable with patch as well:
    // connection is already closing.
    0.701 > R. 1001:1001(0) ack 1002 win 501
    // End packetdrill test case.

    With patch, this generates following conntrack events:
    [NEW] 120 SYN_SENT src=10.0.2.1 dst=10.0.0.1 sport=5437 dport=80 [UNREPLIED]
    [UPDATE] 60 SYN_RECV src=10.0.2.1 dst=10.0.0.1 sport=5437 dport=80
    [UPDATE] 432000 ESTABLISHED src=10.0.2.1 dst=10.0.0.1 sport=5437 dport=80 [ASSURED]
    [UPDATE] 120 FIN_WAIT src=10.0.2.1 dst=10.0.0.1 sport=5437 dport=80 [ASSURED]
    [UPDATE] 60 CLOSE_WAIT src=10.0.2.1 dst=10.0.0.1 sport=5437 dport=80 [ASSURED]
    [UPDATE] 10 CLOSE src=10.0.2.1 dst=10.0.0.1 sport=5437 dport=80 [ASSURED]

    Without patch, first RST moves connection to close, whereas socket state
    does not change until FIN is received.
    [NEW] 120 SYN_SENT src=10.0.2.1 dst=10.0.0.1 sport=5141 dport=80 [UNREPLIED]
    [UPDATE] 60 SYN_RECV src=10.0.2.1 dst=10.0.0.1 sport=5141 dport=80
    [UPDATE] 432000 ESTABLISHED src=10.0.2.1 dst=10.0.0.1 sport=5141 dport=80 [ASSURED]
    [UPDATE] 10 CLOSE src=10.0.2.1 dst=10.0.0.1 sport=5141 dport=80 [ASSURED]

    Cc: Jozsef Kadlecsik
    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso
    Signed-off-by: Sasha Levin

    Florian Westphal
     
  • [ Upstream commit a9f5e78c403d2d62ade4f4c85040efc85f4049b8 ]

    Check the result of dereferencing base_chain->stats, instead of result
    of this_cpu_ptr with NULL.

    base_chain->stats maybe be changed to NULL when a chain is updated and a
    new NULL counter can be attached.

    And we do not need to check returning of this_cpu_ptr since
    base_chain->stats is from percpu allocator if it is non-NULL,
    this_cpu_ptr returns a valid value.

    And fix two sparse error by replacing rcu_access_pointer and
    rcu_dereference with READ_ONCE under rcu_read_lock.

    Thanks for Eric's help to finish this patch.

    Fixes: 009240940e84c1 ("netfilter: nf_tables: don't assume chain stats are set when jumplabel is set")
    Signed-off-by: Eric Dumazet
    Signed-off-by: Zhang Yu
    Signed-off-by: Li RongQing
    Signed-off-by: Pablo Neira Ayuso
    Signed-off-by: Sasha Levin

    Li RongQing
     
  • [ Upstream commit 68e2672f8fbd1e04982b8d2798dd318bf2515dd2 ]

    There is a NULL pointer dereference of devname in strspn()

    The oops looks something like:

    CIFS: Attempting to mount (null)
    BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
    ...
    RIP: 0010:strspn+0x0/0x50
    ...
    Call Trace:
    ? cifs_parse_mount_options+0x222/0x1710 [cifs]
    ? cifs_get_volume_info+0x2f/0x80 [cifs]
    cifs_setup_volume_info+0x20/0x190 [cifs]
    cifs_get_volume_info+0x50/0x80 [cifs]
    cifs_smb3_do_mount+0x59/0x630 [cifs]
    ? ida_alloc_range+0x34b/0x3d0
    cifs_do_mount+0x11/0x20 [cifs]
    mount_fs+0x52/0x170
    vfs_kern_mount+0x6b/0x170
    do_mount+0x216/0xdc0
    ksys_mount+0x83/0xd0
    __x64_sys_mount+0x25/0x30
    do_syscall_64+0x65/0x220
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    Fix this by adding a NULL check on devname in cifs_parse_devname()

    Signed-off-by: Yao Liu
    Signed-off-by: Steve French
    Signed-off-by: Sasha Levin

    Yao Liu
     
  • [ Upstream commit 969ae8e8d4ee54c99134d3895f2adf96047f5bee ]

    Old windows version or Netapp SMB server will return
    NT_STATUS_NOT_SUPPORTED since they do not allow or implement
    FSCTL_VALIDATE_NEGOTIATE_INFO. The client should accept the response
    provided it's properly signed.

    See
    https://blogs.msdn.microsoft.com/openspecification/2012/06/28/smb3-secure-dialect-negotiation/

    and

    MS-SMB2 validate negotiate response processing:
    https://msdn.microsoft.com/en-us/library/hh880630.aspx

    Samba client had already handled it.
    https://bugzilla.samba.org/attachment.cgi?id=13285&action=edit

    Signed-off-by: Namjae Jeon
    Signed-off-by: Steve French
    Signed-off-by: Sasha Levin

    Namjae Jeon
     
  • [ Upstream commit 500e0b28ecd3c5aade98f3c3a339d18dcb166bb6 ]

    We use below condition to check inline_xattr_size boundary:

    if (!F2FS_OPTION(sbi).inline_xattr_size ||
    F2FS_OPTION(sbi).inline_xattr_size >=
    DEF_ADDRS_PER_INODE -
    F2FS_TOTAL_EXTRA_ATTR_SIZE -
    DEF_INLINE_RESERVED_SIZE -
    DEF_MIN_INLINE_SIZE)

    There is there problems in that check:
    - we should allow inline_xattr_size equaling to min size of inline
    {data,dentry} area.
    - F2FS_TOTAL_EXTRA_ATTR_SIZE and inline_xattr_size are based on
    different size unit, previous one is 4 bytes, latter one is 1 bytes.
    - DEF_MIN_INLINE_SIZE only indicate min size of inline data area,
    however, we need to consider min size of inline dentry area as well,
    minimal inline dentry should at least contain two entries: '.' and
    '..', so that min inline_dentry size is 40 bytes.

    .bitmap 1 * 1 = 1
    .reserved 1 * 1 = 1
    .dentry 11 * 2 = 22
    .filename 8 * 2 = 16
    total 40

    Signed-off-by: Chao Yu
    Signed-off-by: Jaegeuk Kim
    Signed-off-by: Sasha Levin

    Chao Yu
     
  • [ Upstream commit 70de2cbda8a5d788284469e755f8b097d339c240 ]

    Invoking dm_get_device() twice on the same device path with different
    modes is dangerous. Because in that case, upgrade_mode() will alloc a
    new 'dm_dev' and free the old one, which may be referenced by a previous
    caller. Dereferencing the dangling pointer will trigger kernel NULL
    pointer dereference.

    The following two cases can reproduce this issue. Actually, they are
    invalid setups that must be disallowed, e.g.:

    1. Creating a thin-pool with read_only mode, and the same device as
    both metadata and data.

    dmsetup create thinp --table \
    "0 41943040 thin-pool /dev/vdb /dev/vdb 128 0 1 read_only"

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000080
    ...
    Call Trace:
    new_read+0xfb/0x110 [dm_bufio]
    dm_bm_read_lock+0x43/0x190 [dm_persistent_data]
    ? kmem_cache_alloc_trace+0x15c/0x1e0
    __create_persistent_data_objects+0x65/0x3e0 [dm_thin_pool]
    dm_pool_metadata_open+0x8c/0xf0 [dm_thin_pool]
    pool_ctr.cold.79+0x213/0x913 [dm_thin_pool]
    ? realloc_argv+0x50/0x70 [dm_mod]
    dm_table_add_target+0x14e/0x330 [dm_mod]
    table_load+0x122/0x2e0 [dm_mod]
    ? dev_status+0x40/0x40 [dm_mod]
    ctl_ioctl+0x1aa/0x3e0 [dm_mod]
    dm_ctl_ioctl+0xa/0x10 [dm_mod]
    do_vfs_ioctl+0xa2/0x600
    ? handle_mm_fault+0xda/0x200
    ? __do_page_fault+0x26c/0x4f0
    ksys_ioctl+0x60/0x90
    __x64_sys_ioctl+0x16/0x20
    do_syscall_64+0x55/0x150
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    2. Creating a external snapshot using the same thin-pool device.

    dmsetup create thinp --table \
    "0 41943040 thin-pool /dev/vdc /dev/vdb 128 0 2 ignore_discard"
    dmsetup message /dev/mapper/thinp 0 "create_thin 0"
    dmsetup create snap --table \
    "0 204800 thin /dev/mapper/thinp 0 /dev/mapper/thinp"

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
    ...
    Call Trace:
    ? __alloc_pages_nodemask+0x13c/0x2e0
    retrieve_status+0xa5/0x1f0 [dm_mod]
    ? dm_get_live_or_inactive_table.isra.7+0x20/0x20 [dm_mod]
    table_status+0x61/0xa0 [dm_mod]
    ctl_ioctl+0x1aa/0x3e0 [dm_mod]
    dm_ctl_ioctl+0xa/0x10 [dm_mod]
    do_vfs_ioctl+0xa2/0x600
    ksys_ioctl+0x60/0x90
    ? ksys_write+0x4f/0xb0
    __x64_sys_ioctl+0x16/0x20
    do_syscall_64+0x55/0x150
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Signed-off-by: Jason Cai (Xiang Feng)
    Signed-off-by: Mike Snitzer
    Signed-off-by: Sasha Levin

    Jason Cai (Xiang Feng)
     
  • [ Upstream commit 259594bea574e515a148171b5cd84ce5cbdc028a ]

    When compiling with -Wformat, clang emits the following warnings:

    fs/cifs/smb1ops.c:312:20: warning: format specifies type 'unsigned
    short' but the argument has type 'unsigned int' [-Wformat]
    tgt_total_cnt, total_in_tgt);
    ^~~~~~~~~~~~

    fs/cifs/cifs_dfs_ref.c:289:4: warning: format specifies type 'short'
    but the argument has type 'int' [-Wformat]
    ref->flags, ref->server_type);
    ^~~~~~~~~~

    fs/cifs/cifs_dfs_ref.c:289:16: warning: format specifies type 'short'
    but the argument has type 'int' [-Wformat]
    ref->flags, ref->server_type);
    ^~~~~~~~~~~~~~~~

    fs/cifs/cifs_dfs_ref.c:291:4: warning: format specifies type 'short'
    but the argument has type 'int' [-Wformat]
    ref->ref_flag, ref->path_consumed);
    ^~~~~~~~~~~~~

    fs/cifs/cifs_dfs_ref.c:291:19: warning: format specifies type 'short'
    but the argument has type 'int' [-Wformat]
    ref->ref_flag, ref->path_consumed);
    ^~~~~~~~~~~~~~~~~~
    The types of these arguments are unconditionally defined, so this patch
    updates the format character to the correct ones for ints and unsigned
    ints.

    Link: https://github.com/ClangBuiltLinux/linux/issues/378

    Signed-off-by: Louis Taylor
    Signed-off-by: Steve French
    Reviewed-by: Nick Desaulniers
    Signed-off-by: Sasha Levin

    Louis Taylor
     
  • [ Upstream commit 4117992df66a26fa33908b4969e04801534baab1 ]

    KASAN does not play well with the page poisoning (CONFIG_PAGE_POISONING).
    It triggers false positives in the allocation path:

    BUG: KASAN: use-after-free in memchr_inv+0x2ea/0x330
    Read of size 8 at addr ffff88881f800000 by task swapper/0
    CPU: 0 PID: 0 Comm: swapper Not tainted 5.0.0-rc1+ #54
    Call Trace:
    dump_stack+0xe0/0x19a
    print_address_description.cold.2+0x9/0x28b
    kasan_report.cold.3+0x7a/0xb5
    __asan_report_load8_noabort+0x19/0x20
    memchr_inv+0x2ea/0x330
    kernel_poison_pages+0x103/0x3d5
    get_page_from_freelist+0x15e7/0x4d90

    because KASAN has not yet unpoisoned the shadow page for allocation
    before it checks memchr_inv() but only found a stale poison pattern.

    Also, false positives in free path,

    BUG: KASAN: slab-out-of-bounds in kernel_poison_pages+0x29e/0x3d5
    Write of size 4096 at addr ffff8888112cc000 by task swapper/0/1
    CPU: 5 PID: 1 Comm: swapper/0 Not tainted 5.0.0-rc1+ #55
    Call Trace:
    dump_stack+0xe0/0x19a
    print_address_description.cold.2+0x9/0x28b
    kasan_report.cold.3+0x7a/0xb5
    check_memory_region+0x22d/0x250
    memset+0x28/0x40
    kernel_poison_pages+0x29e/0x3d5
    __free_pages_ok+0x75f/0x13e0

    due to KASAN adds poisoned redzones around slab objects, but the page
    poisoning needs to poison the whole page.

    Link: http://lkml.kernel.org/r/20190114233405.67843-1-cai@lca.pw
    Signed-off-by: Qian Cai
    Acked-by: Andrey Ryabinin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Qian Cai
     
  • [ Upstream commit 5704a06810682683355624923547b41540e2801a ]

    (Taken from https://bugzilla.kernel.org/show_bug.cgi?id=200647)

    'get_unused_fd_flags' in kthread cause kernel crash. It works fine on
    4.1, but causes crash after get 64 fds. It also cause crash on
    ubuntu1404/1604/1804, centos7.5, and the crash messages are almost the
    same.

    The crash message on centos7.5 shows below:

    start fd 61
    start fd 62
    start fd 63
    BUG: unable to handle kernel NULL pointer dereference at (null)
    IP: __wake_up_common+0x2e/0x90
    PGD 0
    Oops: 0000 [#1] SMP
    Modules linked in: test(OE) xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter devlink sunrpc kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd sg ppdev pcspkr virtio_balloon parport_pc parport i2c_piix4 joydev ip_tables xfs libcrc32c sr_mod cdrom sd_mod crc_t10dif crct10dif_generic ata_generic pata_acpi virtio_scsi virtio_console virtio_net cirrus drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm crct10dif_pclmul crct10dif_common crc32c_intel drm ata_piix serio_raw libata virtio_pci virtio_ring i2c_core
    virtio floppy dm_mirror dm_region_hash dm_log dm_mod
    CPU: 2 PID: 1820 Comm: test_fd Kdump: loaded Tainted: G OE ------------ 3.10.0-862.3.3.el7.x86_64 #1
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
    task: ffff8e92b9431fa0 ti: ffff8e94247a0000 task.ti: ffff8e94247a0000
    RIP: 0010:__wake_up_common+0x2e/0x90
    RSP: 0018:ffff8e94247a2d18 EFLAGS: 00010086
    RAX: 0000000000000000 RBX: ffffffff9d09daa0 RCX: 0000000000000000
    RDX: 0000000000000000 RSI: 0000000000000003 RDI: ffffffff9d09daa0
    RBP: ffff8e94247a2d50 R08: 0000000000000000 R09: ffff8e92b95dfda8
    R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff9d09daa8
    R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000003
    FS: 0000000000000000(0000) GS:ffff8e9434e80000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000000000 CR3: 000000017c686000 CR4: 00000000000207e0
    Call Trace:
    __wake_up+0x39/0x50
    expand_files+0x131/0x250
    __alloc_fd+0x47/0x170
    get_unused_fd_flags+0x30/0x40
    test_fd+0x12a/0x1c0 [test]
    kthread+0xd1/0xe0
    ret_from_fork_nospec_begin+0x21/0x21
    Code: 66 90 55 48 89 e5 41 57 41 89 f7 41 56 41 89 ce 41 55 41 54 49 89 fc 49 83 c4 08 53 48 83 ec 10 48 8b 47 08 89 55 cc 4c 89 45 d0 8b 08 49 39 c4 48 8d 78 e8 4c 8d 69 e8 75 08 eb 3b 4c 89 ef
    RIP __wake_up_common+0x2e/0x90
    RSP
    CR2: 0000000000000000

    This issue exists since CentOS 7.5 3.10.0-862 and CentOS 7.4
    (3.10.0-693.21.1 ) is ok. Root cause: the item 'resize_wait' is not
    initialized before being used.

    Reported-by: Richard Zhang
    Reviewed-by: Andrew Morton
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Shuriyc Chu
     
  • [ Upstream commit 9083977dabf3833298ddcd40dee28687f1e6b483 ]

    Fix below warning coming because of using mutex lock in atomic context.

    BUG: sleeping function called from invalid context at kernel/locking/mutex.c:98
    in_atomic(): 1, irqs_disabled(): 0, pid: 585, name: sh
    Preemption disabled at: __radix_tree_preload+0x28/0x130
    Call trace:
    dump_backtrace+0x0/0x2b4
    show_stack+0x20/0x28
    dump_stack+0xa8/0xe0
    ___might_sleep+0x144/0x194
    __might_sleep+0x58/0x8c
    mutex_lock+0x2c/0x48
    f2fs_trace_pid+0x88/0x14c
    f2fs_set_node_page_dirty+0xd0/0x184

    Do not use f2fs_radix_tree_insert() to avoid doing cond_resched() with
    spin_lock() acquired.

    Signed-off-by: Sahitya Tummala
    Reviewed-by: Chao Yu
    Signed-off-by: Jaegeuk Kim
    Signed-off-by: Sasha Levin

    Sahitya Tummala
     
  • [ Upstream commit cc725ef3cb202ef2019a3c67c8913efa05c3cce6 ]

    In the process of creating a node, it will cause NULL pointer
    dereference in kernel if o2cb_ctl failed in the interval (mkdir,
    o2cb_set_node_attribute(node_num)] in function o2cb_add_node.

    The node num is initialized to 0 in function o2nm_node_group_make_item,
    o2nm_node_group_drop_item will mistake the node number 0 for a valid
    node number when we delete the node before the node number is set
    correctly. If the local node number of the current host happens to be
    0, cluster->cl_local_node will be set to O2NM_INVALID_NODE_NUM while
    o2hb_thread still running. The panic stack is generated as follows:

    o2hb_thread
    \-o2hb_do_disk_heartbeat
    \-o2hb_check_own_slot
    |-slot = ®->hr_slots[o2nm_this_node()];
    //o2nm_this_node() return O2NM_INVALID_NODE_NUM

    We need to check whether the node number is set when we delete the node.

    Link: http://lkml.kernel.org/r/133d8045-72cc-863e-8eae-5013f9f6bc51@huawei.com
    Signed-off-by: Jia Guo
    Reviewed-by: Joseph Qi
    Acked-by: Jun Piao
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Changwei Ge
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Jia Guo
     
  • [ Upstream commit 92d1d07daad65c300c7d0b68bbef8867e9895d54 ]

    Kmemleak throws endless warnings during boot due to in
    __alloc_alien_cache(),

    alc = kmalloc_node(memsize, gfp, node);
    init_arraycache(&alc->ac, entries, batch);
    kmemleak_no_scan(ac);

    Kmemleak does not track the array cache (alc->ac) but the alien cache
    (alc) instead, so let it track the latter by lifting kmemleak_no_scan()
    out of init_arraycache().

    There is another place that calls init_arraycache(), but
    alloc_kmem_cache_cpus() uses the percpu allocation where will never be
    considered as a leak.

    kmemleak: Found object by alias at 0xffff8007b9aa7e38
    CPU: 190 PID: 1 Comm: swapper/0 Not tainted 5.0.0-rc2+ #2
    Call trace:
    dump_backtrace+0x0/0x168
    show_stack+0x24/0x30
    dump_stack+0x88/0xb0
    lookup_object+0x84/0xac
    find_and_get_object+0x84/0xe4
    kmemleak_no_scan+0x74/0xf4
    setup_kmem_cache_node+0x2b4/0x35c
    __do_tune_cpucache+0x250/0x2d4
    do_tune_cpucache+0x4c/0xe4
    enable_cpucache+0xc8/0x110
    setup_cpu_cache+0x40/0x1b8
    __kmem_cache_create+0x240/0x358
    create_cache+0xc0/0x198
    kmem_cache_create_usercopy+0x158/0x20c
    kmem_cache_create+0x50/0x64
    fsnotify_init+0x58/0x6c
    do_one_initcall+0x194/0x388
    kernel_init_freeable+0x668/0x688
    kernel_init+0x18/0x124
    ret_from_fork+0x10/0x18
    kmemleak: Object 0xffff8007b9aa7e00 (size 256):
    kmemleak: comm "swapper/0", pid 1, jiffies 4294697137
    kmemleak: min_count = 1
    kmemleak: count = 0
    kmemleak: flags = 0x1
    kmemleak: checksum = 0
    kmemleak: backtrace:
    kmemleak_alloc+0x84/0xb8
    kmem_cache_alloc_node_trace+0x31c/0x3a0
    __kmalloc_node+0x58/0x78
    setup_kmem_cache_node+0x26c/0x35c
    __do_tune_cpucache+0x250/0x2d4
    do_tune_cpucache+0x4c/0xe4
    enable_cpucache+0xc8/0x110
    setup_cpu_cache+0x40/0x1b8
    __kmem_cache_create+0x240/0x358
    create_cache+0xc0/0x198
    kmem_cache_create_usercopy+0x158/0x20c
    kmem_cache_create+0x50/0x64
    fsnotify_init+0x58/0x6c
    do_one_initcall+0x194/0x388
    kernel_init_freeable+0x668/0x688
    kernel_init+0x18/0x124
    kmemleak: Not scanning unknown object at 0xffff8007b9aa7e38
    CPU: 190 PID: 1 Comm: swapper/0 Not tainted 5.0.0-rc2+ #2
    Call trace:
    dump_backtrace+0x0/0x168
    show_stack+0x24/0x30
    dump_stack+0x88/0xb0
    kmemleak_no_scan+0x90/0xf4
    setup_kmem_cache_node+0x2b4/0x35c
    __do_tune_cpucache+0x250/0x2d4
    do_tune_cpucache+0x4c/0xe4
    enable_cpucache+0xc8/0x110
    setup_cpu_cache+0x40/0x1b8
    __kmem_cache_create+0x240/0x358
    create_cache+0xc0/0x198
    kmem_cache_create_usercopy+0x158/0x20c
    kmem_cache_create+0x50/0x64
    fsnotify_init+0x58/0x6c
    do_one_initcall+0x194/0x388
    kernel_init_freeable+0x668/0x688
    kernel_init+0x18/0x124
    ret_from_fork+0x10/0x18

    Link: http://lkml.kernel.org/r/20190129184518.39808-1-cai@lca.pw
    Fixes: 1fe00d50a9e8 ("slab: factor out initialization of array cache")
    Signed-off-by: Qian Cai
    Reviewed-by: Andrew Morton
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Catalin Marinas
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Qian Cai
     
  • [ Upstream commit afd07389d3f4933c7f7817a92fb5e053d59a3182 ]

    One of the vmalloc stress test case triggers the kernel BUG():


    [60.562151] ------------[ cut here ]------------
    [60.562154] kernel BUG at mm/vmalloc.c:512!
    [60.562206] invalid opcode: 0000 [#1] PREEMPT SMP PTI
    [60.562247] CPU: 0 PID: 430 Comm: vmalloc_test/0 Not tainted 4.20.0+ #161
    [60.562293] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
    [60.562351] RIP: 0010:alloc_vmap_area+0x36f/0x390

    it can happen due to big align request resulting in overflowing of
    calculated address, i.e. it becomes 0 after ALIGN()'s fixup.

    Fix it by checking if calculated address is within vstart/vend range.

    Link: http://lkml.kernel.org/r/20190124115648.9433-2-urezki@gmail.com
    Signed-off-by: Uladzislau Rezki (Sony)
    Reviewed-by: Andrew Morton
    Cc: Ingo Molnar
    Cc: Joel Fernandes
    Cc: Matthew Wilcox
    Cc: Michal Hocko
    Cc: Oleksiy Avramchenko
    Cc: Steven Rostedt
    Cc: Tejun Heo
    Cc: Thomas Garnier
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Uladzislau Rezki (Sony)
     
  • [ Upstream commit 2e25644e8da4ed3a27e7b8315aaae74660be72dc ]

    Syzbot with KMSAN reports (excerpt):

    ==================================================================
    BUG: KMSAN: uninit-value in mpol_rebind_policy mm/mempolicy.c:353 [inline]
    BUG: KMSAN: uninit-value in mpol_rebind_mm+0x249/0x370 mm/mempolicy.c:384
    CPU: 1 PID: 17420 Comm: syz-executor4 Not tainted 4.20.0-rc7+ #15
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
    Google 01/01/2011
    Call Trace:
    __dump_stack lib/dump_stack.c:77 [inline]
    dump_stack+0x173/0x1d0 lib/dump_stack.c:113
    kmsan_report+0x12e/0x2a0 mm/kmsan/kmsan.c:613
    __msan_warning+0x82/0xf0 mm/kmsan/kmsan_instr.c:295
    mpol_rebind_policy mm/mempolicy.c:353 [inline]
    mpol_rebind_mm+0x249/0x370 mm/mempolicy.c:384
    update_tasks_nodemask+0x608/0xca0 kernel/cgroup/cpuset.c:1120
    update_nodemasks_hier kernel/cgroup/cpuset.c:1185 [inline]
    update_nodemask kernel/cgroup/cpuset.c:1253 [inline]
    cpuset_write_resmask+0x2a98/0x34b0 kernel/cgroup/cpuset.c:1728

    ...

    Uninit was created at:
    kmsan_save_stack_with_flags mm/kmsan/kmsan.c:204 [inline]
    kmsan_internal_poison_shadow+0x92/0x150 mm/kmsan/kmsan.c:158
    kmsan_kmalloc+0xa6/0x130 mm/kmsan/kmsan_hooks.c:176
    kmem_cache_alloc+0x572/0xb90 mm/slub.c:2777
    mpol_new mm/mempolicy.c:276 [inline]
    do_mbind mm/mempolicy.c:1180 [inline]
    kernel_mbind+0x8a7/0x31a0 mm/mempolicy.c:1347
    __do_sys_mbind mm/mempolicy.c:1354 [inline]

    As it's difficult to report where exactly the uninit value resides in
    the mempolicy object, we have to guess a bit. mm/mempolicy.c:353
    contains this part of mpol_rebind_policy():

    if (!mpol_store_user_nodemask(pol) &&
    nodes_equal(pol->w.cpuset_mems_allowed, *newmask))

    "mpol_store_user_nodemask(pol)" is testing pol->flags, which I couldn't
    ever see being uninitialized after leaving mpol_new(). So I'll guess
    it's actually about accessing pol->w.cpuset_mems_allowed on line 354,
    but still part of statement starting on line 353.

    For w.cpuset_mems_allowed to be not initialized, and the nodes_equal()
    reachable for a mempolicy where mpol_set_nodemask() is called in
    do_mbind(), it seems the only possibility is a MPOL_PREFERRED policy
    with empty set of nodes, i.e. MPOL_LOCAL equivalent, with MPOL_F_LOCAL
    flag. Let's exclude such policies from the nodes_equal() check. Note
    the uninit access should be benign anyway, as rebinding this kind of
    policy is always a no-op. Therefore no actual need for stable
    inclusion.

    Link: http://lkml.kernel.org/r/a71997c3-e8ae-a787-d5ce-3db05768b27c@suse.cz
    Link: http://lkml.kernel.org/r/73da3e9c-cc84-509e-17d9-0c434bb9967d@suse.cz
    Signed-off-by: Vlastimil Babka
    Reported-by: syzbot+b19c2dc2c990ea657a71@syzkaller.appspotmail.com
    Cc: Alexander Potapenko
    Cc: Dmitry Vyukov
    Cc: Andrea Arcangeli
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: David Rientjes
    Cc: Yisheng Xie
    Cc: zhong jiang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Vlastimil Babka
     
  • [ Upstream commit 7775face207922ea62a4e96b9cd45abfdc7b9840 ]

    If a memory cgroup contains a single process with many threads
    (including different process group sharing the mm) then it is possible
    to trigger a race when the oom killer complains that there are no oom
    elible tasks and complain into the log which is both annoying and
    confusing because there is no actual problem. The race looks as
    follows:

    P1 oom_reaper P2
    try_charge try_charge
    mem_cgroup_out_of_memory
    mutex_lock(oom_lock)
    out_of_memory
    oom_kill_process(P1,P2)
    wake_oom_reaper
    mutex_unlock(oom_lock)
    oom_reap_task
    mutex_lock(oom_lock)
    select_bad_process # no victim

    The problem is more visible with many threads.

    Fix this by checking for fatal_signal_pending from
    mem_cgroup_out_of_memory when the oom_lock is already held.

    The oom bypass is safe because we do the same early in the try_charge
    path already. The situation migh have changed in the mean time. It
    should be safe to check for fatal_signal_pending and tsk_is_oom_victim
    but for a better code readability abstract the current charge bypass
    condition into should_force_charge and reuse it from that path. "

    Link: http://lkml.kernel.org/r/01370f70-e1f6-ebe4-b95e-0df21a0bc15e@i-love.sakura.ne.jp
    Signed-off-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: David Rientjes
    Cc: Kirill Tkhai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Tetsuo Handa
     
  • [ Upstream commit d342a0b38674867ea67fde47b0e1e60ffe9f17a2 ]

    Since setting global init process to some memory cgroup is technically
    possible, oom_kill_memcg_member() must check it.

    Tasks in /test1 are going to be killed due to memory.oom.group set
    Memory cgroup out of memory: Killed process 1 (systemd) total-vm:43400kB, anon-rss:1228kB, file-rss:3992kB, shmem-rss:0kB
    oom_reaper: reaped process 1 (systemd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
    Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000008b

    #include
    #include
    #include
    #include
    #include
    #include

    int main(int argc, char *argv[])
    {
    static char buffer[10485760];
    static int pipe_fd[2] = { EOF, EOF };
    unsigned int i;
    int fd;
    char buf[64] = { };
    if (pipe(pipe_fd))
    return 1;
    if (chdir("/sys/fs/cgroup/"))
    return 1;
    fd = open("cgroup.subtree_control", O_WRONLY);
    write(fd, "+memory", 7);
    close(fd);
    mkdir("test1", 0755);
    fd = open("test1/memory.oom.group", O_WRONLY);
    write(fd, "1", 1);
    close(fd);
    fd = open("test1/cgroup.procs", O_WRONLY);
    write(fd, "1", 1);
    snprintf(buf, sizeof(buf) - 1, "%d", getpid());
    write(fd, buf, strlen(buf));
    close(fd);
    snprintf(buf, sizeof(buf) - 1, "%lu", sizeof(buffer) * 5);
    fd = open("test1/memory.max", O_WRONLY);
    write(fd, buf, strlen(buf));
    close(fd);
    for (i = 0; i < 10; i++)
    if (fork() == 0) {
    char c;
    close(pipe_fd[1]);
    read(pipe_fd[0], &c, 1);
    memset(buffer, 0, sizeof(buffer));
    sleep(3);
    _exit(0);
    }
    close(pipe_fd[0]);
    close(pipe_fd[1]);
    sleep(3);
    return 0;
    }

    [ 37.052923][ T9185] a.out invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
    [ 37.056169][ T9185] CPU: 4 PID: 9185 Comm: a.out Kdump: loaded Not tainted 5.0.0-rc4-next-20190131 #280
    [ 37.059205][ T9185] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
    [ 37.062954][ T9185] Call Trace:
    [ 37.063976][ T9185] dump_stack+0x67/0x95
    [ 37.065263][ T9185] dump_header+0x51/0x570
    [ 37.066619][ T9185] ? trace_hardirqs_on+0x3f/0x110
    [ 37.068171][ T9185] ? _raw_spin_unlock_irqrestore+0x3d/0x70
    [ 37.069967][ T9185] oom_kill_process+0x18d/0x210
    [ 37.071515][ T9185] out_of_memory+0x11b/0x380
    [ 37.072936][ T9185] mem_cgroup_out_of_memory+0xb6/0xd0
    [ 37.074601][ T9185] try_charge+0x790/0x820
    [ 37.076021][ T9185] mem_cgroup_try_charge+0x42/0x1d0
    [ 37.077629][ T9185] mem_cgroup_try_charge_delay+0x11/0x30
    [ 37.079370][ T9185] do_anonymous_page+0x105/0x5e0
    [ 37.080939][ T9185] __handle_mm_fault+0x9cb/0x1070
    [ 37.082485][ T9185] handle_mm_fault+0x1b2/0x3a0
    [ 37.083819][ T9185] ? handle_mm_fault+0x47/0x3a0
    [ 37.085181][ T9185] __do_page_fault+0x255/0x4c0
    [ 37.086529][ T9185] do_page_fault+0x28/0x260
    [ 37.087788][ T9185] ? page_fault+0x8/0x30
    [ 37.088978][ T9185] page_fault+0x1e/0x30
    [ 37.090142][ T9185] RIP: 0033:0x7f8b183aefe0
    [ 37.091433][ T9185] Code: 20 f3 44 0f 7f 44 17 d0 f3 44 0f 7f 47 30 f3 44 0f 7f 44 17 c0 48 01 fa 48 83 e2 c0 48 39 d1 74 a3 66 0f 1f 84 00 00 00 00 00 44 0f 7f 01 66 44 0f 7f 41 10 66 44 0f 7f 41 20 66 44 0f 7f 41
    [ 37.096917][ T9185] RSP: 002b:00007fffc5d329e8 EFLAGS: 00010206
    [ 37.098615][ T9185] RAX: 00000000006010e0 RBX: 0000000000000008 RCX: 0000000000c30000
    [ 37.100905][ T9185] RDX: 00000000010010c0 RSI: 0000000000000000 RDI: 00000000006010e0
    [ 37.103349][ T9185] RBP: 0000000000000000 R08: 00007f8b188f4740 R09: 0000000000000000
    [ 37.105797][ T9185] R10: 00007fffc5d32420 R11: 00007f8b183aef40 R12: 0000000000000005
    [ 37.108228][ T9185] R13: 0000000000000000 R14: ffffffffffffffff R15: 0000000000000000
    [ 37.110840][ T9185] memory: usage 51200kB, limit 51200kB, failcnt 125
    [ 37.113045][ T9185] memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
    [ 37.115808][ T9185] kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
    [ 37.117660][ T9185] Memory cgroup stats for /test1: cache:0KB rss:49484KB rss_huge:30720KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB inactive_anon:0KB active_anon:49700KB inactive_file:0KB active_file:0KB unevictable:0KB
    [ 37.123371][ T9185] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,oom_memcg=/test1,task_memcg=/test1,task=a.out,pid=9188,uid=0
    [ 37.128158][ T9185] Memory cgroup out of memory: Killed process 9188 (a.out) total-vm:14456kB, anon-rss:10324kB, file-rss:504kB, shmem-rss:0kB
    [ 37.132710][ T9185] Tasks in /test1 are going to be killed due to memory.oom.group set
    [ 37.132833][ T54] oom_reaper: reaped process 9188 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
    [ 37.135498][ T9185] Memory cgroup out of memory: Killed process 1 (systemd) total-vm:43400kB, anon-rss:1228kB, file-rss:3992kB, shmem-rss:0kB
    [ 37.143434][ T9185] Memory cgroup out of memory: Killed process 9182 (a.out) total-vm:14456kB, anon-rss:76kB, file-rss:588kB, shmem-rss:0kB
    [ 37.144328][ T54] oom_reaper: reaped process 1 (systemd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
    [ 37.147585][ T9185] Memory cgroup out of memory: Killed process 9183 (a.out) total-vm:14456kB, anon-rss:6228kB, file-rss:512kB, shmem-rss:0kB
    [ 37.157222][ T9185] Memory cgroup out of memory: Killed process 9184 (a.out) total-vm:14456kB, anon-rss:6228kB, file-rss:508kB, shmem-rss:0kB
    [ 37.157259][ T9185] Memory cgroup out of memory: Killed process 9185 (a.out) total-vm:14456kB, anon-rss:6228kB, file-rss:512kB, shmem-rss:0kB
    [ 37.157291][ T9185] Memory cgroup out of memory: Killed process 9186 (a.out) total-vm:14456kB, anon-rss:4180kB, file-rss:508kB, shmem-rss:0kB
    [ 37.157306][ T54] oom_reaper: reaped process 9183 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
    [ 37.157328][ T9185] Memory cgroup out of memory: Killed process 9187 (a.out) total-vm:14456kB, anon-rss:4180kB, file-rss:512kB, shmem-rss:0kB
    [ 37.157452][ T9185] Memory cgroup out of memory: Killed process 9189 (a.out) total-vm:14456kB, anon-rss:6228kB, file-rss:512kB, shmem-rss:0kB
    [ 37.158733][ T9185] Memory cgroup out of memory: Killed process 9190 (a.out) total-vm:14456kB, anon-rss:552kB, file-rss:512kB, shmem-rss:0kB
    [ 37.160083][ T54] oom_reaper: reaped process 9186 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
    [ 37.160187][ T54] oom_reaper: reaped process 9189 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
    [ 37.206941][ T54] oom_reaper: reaped process 9185 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
    [ 37.212300][ T9185] Memory cgroup out of memory: Killed process 9191 (a.out) total-vm:14456kB, anon-rss:4180kB, file-rss:512kB, shmem-rss:0kB
    [ 37.212317][ T54] oom_reaper: reaped process 9190 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
    [ 37.218860][ T9185] Memory cgroup out of memory: Killed process 9192 (a.out) total-vm:14456kB, anon-rss:1080kB, file-rss:512kB, shmem-rss:0kB
    [ 37.227667][ T54] oom_reaper: reaped process 9192 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
    [ 37.292323][ T9193] abrt-hook-ccpp (9193) used greatest stack depth: 10480 bytes left
    [ 37.351843][ T1] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000008b
    [ 37.354833][ T1] CPU: 7 PID: 1 Comm: systemd Kdump: loaded Not tainted 5.0.0-rc4-next-20190131 #280
    [ 37.357876][ T1] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
    [ 37.361685][ T1] Call Trace:
    [ 37.363239][ T1] dump_stack+0x67/0x95
    [ 37.365010][ T1] panic+0xfc/0x2b0
    [ 37.366853][ T1] do_exit+0xd55/0xd60
    [ 37.368595][ T1] do_group_exit+0x47/0xc0
    [ 37.370415][ T1] get_signal+0x32a/0x920
    [ 37.372449][ T1] ? _raw_spin_unlock_irqrestore+0x3d/0x70
    [ 37.374596][ T1] do_signal+0x32/0x6e0
    [ 37.376430][ T1] ? exit_to_usermode_loop+0x26/0x9b
    [ 37.378418][ T1] ? prepare_exit_to_usermode+0xa8/0xd0
    [ 37.380571][ T1] exit_to_usermode_loop+0x3e/0x9b
    [ 37.382588][ T1] prepare_exit_to_usermode+0xa8/0xd0
    [ 37.384594][ T1] ? page_fault+0x8/0x30
    [ 37.386453][ T1] retint_user+0x8/0x18
    [ 37.388160][ T1] RIP: 0033:0x7f42c06974a8
    [ 37.389922][ T1] Code: Bad RIP value.
    [ 37.391788][ T1] RSP: 002b:00007ffc3effd388 EFLAGS: 00010213
    [ 37.394075][ T1] RAX: 000000000000000e RBX: 00007ffc3effd390 RCX: 0000000000000000
    [ 37.396963][ T1] RDX: 000000000000002a RSI: 00007ffc3effd390 RDI: 0000000000000004
    [ 37.399550][ T1] RBP: 00007ffc3effd680 R08: 0000000000000000 R09: 0000000000000000
    [ 37.402334][ T1] R10: 00000000ffffffff R11: 0000000000000246 R12: 0000000000000001
    [ 37.404890][ T1] R13: ffffffffffffffff R14: 0000000000000884 R15: 000056460b1ac3b0

    Link: http://lkml.kernel.org/r/201902010336.x113a4EO027170@www262.sakura.ne.jp
    Fixes: 3d8b38eb81cac813 ("mm, oom: introduce memory.oom.group")
    Signed-off-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Cc: Roman Gushchin
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Tetsuo Handa
     
  • [ Upstream commit c10d38cc8d3e43f946b6c2bf4602c86791587f30 ]

    Dan Carpenter reports a potential NULL dereference in
    get_swap_page_of_type:

    Smatch complains that the NULL checks on "si" aren't consistent. This
    seems like a real bug because we have not ensured that the type is
    valid and so "si" can be NULL.

    Add the missing check for NULL, taking care to use a read barrier to
    ensure CPU1 observes CPU0's updates in the correct order:

    CPU0 CPU1
    alloc_swap_info() if (type >= nr_swapfiles)
    swap_info[type] = p /* handle invalid entry */
    smp_wmb() smp_rmb()
    ++nr_swapfiles p = swap_info[type]

    Without smp_rmb, CPU1 might observe CPU0's write to nr_swapfiles before
    CPU0's write to swap_info[type] and read NULL from swap_info[type].

    Ying Huang noticed other places in swapfile.c don't order these reads
    properly. Introduce swap_type_to_swap_info to encourage correct usage.

    Use READ_ONCE and WRITE_ONCE to follow the Linux Kernel Memory Model
    (see tools/memory-model/Documentation/explanation.txt).

    This ordering need not be enforced in places where swap_lock is held
    (e.g. si_swapinfo) because swap_lock serializes updates to nr_swapfiles
    and the swap_info array.

    Link: http://lkml.kernel.org/r/20190131024410.29859-1-daniel.m.jordan@oracle.com
    Fixes: ec8acf20afb8 ("swap: add per-partition lock for swapfile")
    Signed-off-by: Daniel Jordan
    Reported-by: Dan Carpenter
    Suggested-by: "Huang, Ying"
    Reviewed-by: Andrea Parri
    Acked-by: Peter Zijlstra (Intel)
    Cc: Alan Stern
    Cc: Andi Kleen
    Cc: Dave Hansen
    Cc: Omar Sandoval
    Cc: Paul McKenney
    Cc: Shaohua Li
    Cc: Stephen Rothwell
    Cc: Tejun Heo
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Daniel Jordan
     
  • [ Upstream commit 0c81585499601acd1d0e1cbf424cabfaee60628c ]

    After offlining a memory block, kmemleak scan will trigger a crash, as
    it encounters a page ext address that has already been freed during
    memory offlining. At the beginning in alloc_page_ext(), it calls
    kmemleak_alloc(), but it does not call kmemleak_free() in
    free_page_ext().

    BUG: unable to handle kernel paging request at ffff888453d00000
    PGD 128a01067 P4D 128a01067 PUD 128a04067 PMD 47e09e067 PTE 800ffffbac2ff060
    Oops: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI
    CPU: 1 PID: 1594 Comm: bash Not tainted 5.0.0-rc8+ #15
    Hardware name: HP ProLiant DL180 Gen9/ProLiant DL180 Gen9, BIOS U20 10/25/2017
    RIP: 0010:scan_block+0xb5/0x290
    Code: 85 6e 01 00 00 48 b8 00 00 30 f5 81 88 ff ff 48 39 c3 0f 84 5b 01 00 00 48 89 d8 48 c1 e8 03 42 80 3c 20 00 0f 85 87 01 00 00 8b 3b e8 f3 0c fa ff 4c 39 3d 0c 6b 4c 01 0f 87 08 01 00 00 4c
    RSP: 0018:ffff8881ec57f8e0 EFLAGS: 00010082
    RAX: 0000000000000000 RBX: ffff888453d00000 RCX: ffffffffa61e5a54
    RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff888453d00000
    RBP: ffff8881ec57f920 R08: fffffbfff4ed588d R09: fffffbfff4ed588c
    R10: fffffbfff4ed588c R11: ffffffffa76ac463 R12: dffffc0000000000
    R13: ffff888453d00ff9 R14: ffff8881f80cef48 R15: ffff8881f80cef48
    FS: 00007f6c0e3f8740(0000) GS:ffff8881f7680000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: ffff888453d00000 CR3: 00000001c4244003 CR4: 00000000001606a0
    Call Trace:
    scan_gray_list+0x269/0x430
    kmemleak_scan+0x5a8/0x10f0
    kmemleak_write+0x541/0x6ca
    full_proxy_write+0xf8/0x190
    __vfs_write+0xeb/0x980
    vfs_write+0x15a/0x4f0
    ksys_write+0xd2/0x1b0
    __x64_sys_write+0x73/0xb0
    do_syscall_64+0xeb/0xaaa
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    RIP: 0033:0x7f6c0dad73b8
    Code: 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 65 63 2d 00 8b 00 85 c0 75 17 b8 01 00 00 00 0f 05 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 49 89 d4 55
    RSP: 002b:00007ffd5b863cb8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
    RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f6c0dad73b8
    RDX: 0000000000000005 RSI: 000055a9216e1710 RDI: 0000000000000001
    RBP: 000055a9216e1710 R08: 000000000000000a R09: 00007ffd5b863840
    R10: 000000000000000a R11: 0000000000000246 R12: 00007f6c0dda9780
    R13: 0000000000000005 R14: 00007f6c0dda4740 R15: 0000000000000005
    Modules linked in: nls_iso8859_1 nls_cp437 vfat fat kvm_intel kvm irqbypass efivars ip_tables x_tables xfs sd_mod ahci libahci igb i2c_algo_bit libata i2c_core dm_mirror dm_region_hash dm_log dm_mod efivarfs
    CR2: ffff888453d00000
    ---[ end trace ccf646c7456717c5 ]---
    Kernel panic - not syncing: Fatal exception
    Shutting down cpus with NMI
    Kernel Offset: 0x24c00000 from 0xffffffff81000000 (relocation range:
    0xffffffff80000000-0xffffffffbfffffff)
    ---[ end Kernel panic - not syncing: Fatal exception ]---

    Link: http://lkml.kernel.org/r/20190227173147.75650-1-cai@lca.pw
    Signed-off-by: Qian Cai
    Reviewed-by: Catalin Marinas
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Qian Cai
     
  • [ Upstream commit 0d3bd18a5efd66097ef58622b898d3139790aa9d ]

    In case cma_init_reserved_mem failed, need to free the memblock
    allocated by memblock_reserve or memblock_alloc_range.

    Quote Catalin's comments:
    https://lkml.org/lkml/2019/2/26/482

    Kmemleak is supposed to work with the memblock_{alloc,free} pair and it
    ignores the memblock_reserve() as a memblock_alloc() implementation
    detail. It is, however, tolerant to memblock_free() being called on
    a sub-range or just a different range from a previous memblock_alloc().
    So the original patch looks fine to me. FWIW:

    Link: http://lkml.kernel.org/r/20190227144631.16708-1-peng.fan@nxp.com
    Signed-off-by: Peng Fan
    Reviewed-by: Catalin Marinas
    Reviewed-by: Mike Rapoport
    Cc: Laura Abbott
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Marek Szyprowski
    Cc: Andrey Konovalov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Peng Fan
     
  • [ Upstream commit d778015ac95bc036af73342c878ab19250e01fe1 ]

    next_present_section_nr() could only return an unsigned number -1, so
    just check it specifically where compilers will convert -1 to unsigned
    if needed.

    mm/sparse.c: In function 'sparse_init_nid':
    mm/sparse.c:200:20: warning: comparison of unsigned expression >= 0 is always true [-Wtype-limits]
    ((section_nr >= 0) && \
    ^~
    mm/sparse.c:478:2: note: in expansion of macro
    'for_each_present_section_nr'
    for_each_present_section_nr(pnum_begin, pnum) {
    ^~~~~~~~~~~~~~~~~~~~~~~~~~~
    mm/sparse.c:200:20: warning: comparison of unsigned expression >= 0 is always true [-Wtype-limits]
    ((section_nr >= 0) && \
    ^~
    mm/sparse.c:497:2: note: in expansion of macro
    'for_each_present_section_nr'
    for_each_present_section_nr(pnum_begin, pnum) {
    ^~~~~~~~~~~~~~~~~~~~~~~~~~~
    mm/sparse.c: In function 'sparse_init':
    mm/sparse.c:200:20: warning: comparison of unsigned expression >= 0 is always true [-Wtype-limits]
    ((section_nr >= 0) && \
    ^~
    mm/sparse.c:520:2: note: in expansion of macro
    'for_each_present_section_nr'
    for_each_present_section_nr(pnum_begin + 1, pnum_end) {
    ^~~~~~~~~~~~~~~~~~~~~~~~~~~

    Link: http://lkml.kernel.org/r/20190228181839.86504-1-cai@lca.pw
    Fixes: c4e1be9ec113 ("mm, sparsemem: break out of loops early")
    Signed-off-by: Qian Cai
    Reviewed-by: Andrew Morton
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Qian Cai
     
  • [ Upstream commit e34c940245437f36d2c492edd1f8237eff391064 ]

    Ravi Bangoria reported that we fail with an empty NUMA node with the
    following message:

    $ lscpu
    NUMA node0 CPU(s):
    NUMA node1 CPU(s): 0-4

    $ sudo ./perf c2c report
    node/cpu topology bugFailed setup nodes

    Fix this by detecting the empty node and keeping its CPU set empty.

    Reported-by: Nageswara R Sastry
    Signed-off-by: Jiri Olsa
    Tested-by: Ravi Bangoria
    Cc: Alexander Shishkin
    Cc: Andi Kleen
    Cc: Jonas Rabenstein
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20190305152536.21035-2-jolsa@kernel.org
    Signed-off-by: Arnaldo Carvalho de Melo
    Signed-off-by: Sasha Levin

    Jiri Olsa
     
  • [ Upstream commit 179fb36abb097976997f50733d5b122a29158cba ]

    After commit 68bb7bfb7985 ("X86/Hyper-V: Enable IPI enlightenments"),
    kexec fails with a kernel panic:

    kexec_core: Starting new kernel
    BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
    Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v3.0 03/02/2018
    RIP: 0010:0xffffc9000001d000

    Call Trace:
    ? __send_ipi_mask+0x1c6/0x2d0
    ? hv_send_ipi_mask_allbutself+0x6d/0xb0
    ? mp_save_irq+0x70/0x70
    ? __ioapic_read_entry+0x32/0x50
    ? ioapic_read_entry+0x39/0x50
    ? clear_IO_APIC_pin+0xb8/0x110
    ? native_stop_other_cpus+0x6e/0x170
    ? native_machine_shutdown+0x22/0x40
    ? kernel_kexec+0x136/0x156

    That happens if hypercall based IPIs are used because the hypercall page is
    reset very early upon kexec reboot, but kexec sends IPIs to stop CPUs,
    which invokes the hypercall and dereferences the unusable page.

    To fix his, reset hv_hypercall_pg to NULL before the page is reset to avoid
    any misuse, IPI sending will fall back to the non hypercall based
    method. This only happens on kexec / kdump so just setting the pointer to
    NULL is good enough.

    Fixes: 68bb7bfb7985 ("X86/Hyper-V: Enable IPI enlightenments")
    Signed-off-by: Kairui Song
    Signed-off-by: Thomas Gleixner
    Cc: "K. Y. Srinivasan"
    Cc: Haiyang Zhang
    Cc: Stephen Hemminger
    Cc: Sasha Levin
    Cc: Borislav Petkov
    Cc: "H. Peter Anvin"
    Cc: Vitaly Kuznetsov
    Cc: Dave Young
    Cc: devel@linuxdriverproject.org
    Link: https://lkml.kernel.org/r/20190306111827.14131-1-kasong@redhat.com
    Signed-off-by: Sasha Levin

    Kairui Song
     
  • [ Upstream commit e0f0ae838a25464179d37f355d763f9ec139fc15 ]

    The pm8xxx_get_channel() implementation is unclear, and causes gcc to
    suddenly generate odd warnings. The trigger for the warning (at least
    for me) was the entirely unrelated commit 79a4e91d1bb2 ("device.h: Add
    __cold to dev_ logging functions"), which apparently changes gcc
    code generation in the caller function enough to cause this:

    drivers/iio/adc/qcom-pm8xxx-xoadc.c: In function ‘pm8xxx_xoadc_probe’:
    drivers/iio/adc/qcom-pm8xxx-xoadc.c:633:8: warning: ‘ch’ may be used uninitialized in this function [-Wmaybe-uninitialized]
    ret = pm8xxx_read_channel_rsv(adc, ch, AMUX_RSV4,
    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    &read_nomux_rsv4, true);
    ~~~~~~~~~~~~~~~~~~~~~~~
    drivers/iio/adc/qcom-pm8xxx-xoadc.c:426:27: note: ‘ch’ was declared here
    struct pm8xxx_chan_info *ch;
    ^~

    because gcc for some reason then isn't able to see that the termination
    condition for the "for( )" loop in that function is also the condition
    for returning NULL.

    So it's not _actually_ uninitialized, but the function is admittedly
    just unnecessarily oddly written.

    Simplify and clarify the function, making gcc also see that it always
    returns a valid initialized value.

    Cc: Joe Perches
    Cc: Greg Kroah-Hartman
    Cc: Andy Gross
    Cc: David Brown
    Cc: Jonathan Cameron
    Cc: Hartmut Knaack
    Cc: Lars-Peter Clausen
    Cc: Peter Meerwald-Stadler
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Linus Torvalds
     
  • [ Upstream commit 4790595723d4b833b18c994973d39f9efb842887 ]

    For internal IO and SMP IO, there is a time-out timer for them. In the
    timer handler, it checks whether IO is done according to the flag
    task->task_state_lock.

    There is an issue which may cause system suspended: internal IO or SMP IO
    is sent, but at that time because of hardware exception (such as inject
    2Bit ECC error), so IO is not completed and also not timeout. But, at that
    time, the SAS controller reset occurs to recover system. It will release
    the resource and set the status of IO to be SAS_TASK_STATE_DONE, so when IO
    timeout, it will never complete the completion of IO and wait for ever.

    [ 729.123632] Call trace:
    [ 729.126791] [] __switch_to+0x94/0xa8
    [ 729.133106] [] __schedule+0x1e8/0x7fc
    [ 729.138975] [] schedule+0x34/0x8c
    [ 729.144401] [] schedule_timeout+0x1d8/0x3cc
    [ 729.150690] [] wait_for_common+0xdc/0x1a0
    [ 729.157101] [] wait_for_completion+0x28/0x34
    [ 729.165973] [] hisi_sas_internal_task_abort+0x2a0/0x424 [hisi_sas_test_main]
    [ 729.176447] [] hisi_sas_abort_task+0x244/0x2d8 [hisi_sas_test_main]
    [ 729.185258] [] sas_eh_handle_sas_errors+0x1c8/0x7b8
    [ 729.192391] [] sas_scsi_recover_host+0x130/0x398
    [ 729.199237] [] scsi_error_handler+0x148/0x5c0
    [ 729.206009] [] kthread+0x10c/0x138
    [ 729.211563] [] ret_from_fork+0x10/0x18

    To solve the issue, callback function task_done of those IOs need to be
    called when on SAS controller reset.

    Signed-off-by: Xiang Chen
    Signed-off-by: John Garry
    Signed-off-by: Martin K. Petersen
    Signed-off-by: Sasha Levin

    Xiang Chen
     
  • [ Upstream commit efdcad62e7b8a02fcccc5ccca57806dce1482ac8 ]

    When the PHY comes down, we currently do not set the negotiated linkrate:

    root@(none)$ pwd
    /sys/class/sas_phy/phy-0:0
    root@(none)$ more enable
    1
    root@(none)$ more negotiated_linkrate
    12.0 Gbit
    root@(none)$ echo 0 > enable
    root@(none)$ more negotiated_linkrate
    12.0 Gbit
    root@(none)$

    This patch fixes the driver code to set it properly when the PHY comes
    down.

    If the PHY had been enabled, then set unknown; otherwise, flag as disabled.

    The logical place to set the negotiated linkrate for this scenario is PHY
    down routine, which is called from the PHY down ISR.

    However, it is not possible to know if the PHY comes down due to PHY
    disable or loss of link, as sas_phy.enabled member is not set until after
    the transport disable routine is complete, which races with the PHY down
    ISR.

    As an imperfect solution, use sas_phy_data.enable as the flag to know if
    the PHY is down due to disable. It's imperfect, as sas_phy_data is internal
    to libsas.

    I can't see another way without adding a new field to hisi_sas_phy and
    managing it, or changing SCSI SAS transport.

    Signed-off-by: John Garry
    Signed-off-by: Martin K. Petersen
    Signed-off-by: Sasha Levin

    John Garry
     
  • [ Upstream commit 8e2688876c7f7073d925e1f150e86b8ed3338f52 ]

    libbpf targets don't explicitly depend on fixdep target, so when
    we do 'make -j$(nproc)', there is a high probability, that some
    objects will be built before fixdep binary is available.

    Fix this by running sub-make; this makes sure that fixdep dependency
    is properly accounted for.

    For the same issue in perf, see commit abb26210a395 ("perf tools: Force
    fixdep compilation at the start of the build").

    Before:

    $ rm -rf /tmp/bld; mkdir /tmp/bld; make -j$(nproc) O=/tmp/bld -C tools/lib/bpf/

    Auto-detecting system features:
    ... libelf: [ on ]
    ... bpf: [ on ]

    HOSTCC /tmp/bld/fixdep.o
    CC /tmp/bld/libbpf.o
    CC /tmp/bld/bpf.o
    CC /tmp/bld/btf.o
    CC /tmp/bld/nlattr.o
    CC /tmp/bld/libbpf_errno.o
    CC /tmp/bld/str_error.o
    CC /tmp/bld/netlink.o
    CC /tmp/bld/bpf_prog_linfo.o
    CC /tmp/bld/libbpf_probes.o
    CC /tmp/bld/xsk.o
    HOSTLD /tmp/bld/fixdep-in.o
    LINK /tmp/bld/fixdep
    LD /tmp/bld/libbpf-in.o
    LINK /tmp/bld/libbpf.a
    LINK /tmp/bld/libbpf.so
    LINK /tmp/bld/test_libbpf

    $ head /tmp/bld/.libbpf.o.cmd
    # cannot find fixdep (/usr/local/google/home/sdf/src/linux/xxx//fixdep)
    # using basic dep data

    /tmp/bld/libbpf.o: libbpf.c /usr/include/stdc-predef.h \
    /usr/include/stdlib.h /usr/include/features.h \
    /usr/include/x86_64-linux-gnu/sys/cdefs.h \
    /usr/include/x86_64-linux-gnu/bits/wordsize.h \
    /usr/include/x86_64-linux-gnu/gnu/stubs.h \
    /usr/include/x86_64-linux-gnu/gnu/stubs-64.h \
    /usr/lib/gcc/x86_64-linux-gnu/7/include/stddef.h \

    After:

    $ rm -rf /tmp/bld; mkdir /tmp/bld; make -j$(nproc) O=/tmp/bld -C tools/lib/bpf/

    Auto-detecting system features:
    ... libelf: [ on ]
    ... bpf: [ on ]

    HOSTCC /tmp/bld/fixdep.o
    HOSTLD /tmp/bld/fixdep-in.o
    LINK /tmp/bld/fixdep
    CC /tmp/bld/libbpf.o
    CC /tmp/bld/bpf.o
    CC /tmp/bld/nlattr.o
    CC /tmp/bld/btf.o
    CC /tmp/bld/libbpf_errno.o
    CC /tmp/bld/str_error.o
    CC /tmp/bld/netlink.o
    CC /tmp/bld/bpf_prog_linfo.o
    CC /tmp/bld/libbpf_probes.o
    CC /tmp/bld/xsk.o
    LD /tmp/bld/libbpf-in.o
    LINK /tmp/bld/libbpf.a
    LINK /tmp/bld/libbpf.so
    LINK /tmp/bld/test_libbpf

    $ head /tmp/bld/.libbpf.o.cmd
    cmd_/tmp/bld/libbpf.o := gcc -Wp,-MD,/tmp/bld/.libbpf.o.d -Wp,-MT,/tmp/bld/libbpf.o -g -Wall -DHAVE_LIBELF_MMAP_SUPPORT -DCOMPAT_NEED_REALLOCARRAY -Wbad-function-cast -Wdeclaration-after-statement -Wformat-security -Wformat-y2k -Winit-self -Wmissing-declarations -Wmissing-prototypes -Wnested-externs -Wno-system-headers -Wold-style-definition -Wpacked -Wredundant-decls -Wshadow -Wstrict-prototypes -Wswitch-default -Wswitch-enum -Wundef -Wwrite-strings -Wformat -Wstrict-aliasing=3 -Werror -Wall -fPIC -I. -I/usr/local/google/home/sdf/src/linux/tools/include -I/usr/local/google/home/sdf/src/linux/tools/arch/x86/include/uapi -I/usr/local/google/home/sdf/src/linux/tools/include/uapi -fvisibility=hidden -D"BUILD_STR(s)=$(pound)s" -c -o /tmp/bld/libbpf.o libbpf.c

    source_/tmp/bld/libbpf.o := libbpf.c

    deps_/tmp/bld/libbpf.o := \
    /usr/include/stdc-predef.h \
    /usr/include/stdlib.h \
    /usr/include/features.h \
    /usr/include/x86_64-linux-gnu/sys/cdefs.h \
    /usr/include/x86_64-linux-gnu/bits/wordsize.h \

    Fixes: 7c422f557266 ("tools build: Build fixdep helper from perf and basic libs")
    Reported-by: Eric Dumazet
    Signed-off-by: Stanislav Fomichev
    Acked-by: Yonghong Song
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Sasha Levin

    Stanislav Fomichev
     
  • [ Upstream commit 43d281662fdb46750d49417559b71069f435298d ]

    The enic driver relies on the CONFIG_CPUMASK_OFFSTACK feature to
    dynamically allocate a struct member, but this is normally intended for
    local variables.

    Building with clang, I get a warning for a few locations that check the
    address of the cpumask_var_t:

    drivers/net/ethernet/cisco/enic/enic_main.c:122:22: error: address of array 'enic->msix[i].affinity_mask' will always evaluate to 'true' [-Werror,-Wpointer-bool-conversion]

    As far as I can tell, the code is still correct, as the truth value of
    the pointer is what we need in this configuration. To get rid of
    the warning, use cpumask_available() instead of checking the
    pointer directly.

    Fixes: 322cf7e3a4e8 ("enic: assign affinity hint to interrupts")
    Signed-off-by: Arnd Bergmann
    Reviewed-by: Nathan Chancellor
    Signed-off-by: David S. Miller
    Signed-off-by: Sasha Levin

    Arnd Bergmann
     
  • [ Upstream commit df103170854e87124ee7bdd2bca64b178e653f97 ]

    When building with -Wsometimes-uninitialized, Clang warns:

    drivers/net/ethernet/stmicro/stmmac/stmmac_main.c:495:3: warning: variable 'ns' is used uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized]
    drivers/net/ethernet/stmicro/stmmac/stmmac_main.c:495:3: warning: variable 'ns' is used uninitialized whenever '&&' condition is false [-Wsometimes-uninitialized]
    drivers/net/ethernet/stmicro/stmmac/stmmac_main.c:532:3: warning: variable 'ns' is used uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized]
    drivers/net/ethernet/stmicro/stmmac/stmmac_main.c:532:3: warning: variable 'ns' is used uninitialized whenever '&&' condition is false [-Wsometimes-uninitialized]
    drivers/net/ethernet/stmicro/stmmac/stmmac_main.c:741:3: warning: variable 'sec_inc' is used uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized]
    drivers/net/ethernet/stmicro/stmmac/stmmac_main.c:741:3: warning: variable 'sec_inc' is used uninitialized whenever '&&' condition is false [-Wsometimes-uninitialized]

    Clang is concerned with the use of stmmac_do_void_callback (which
    stmmac_get_timestamp and stmmac_config_sub_second_increment wrap),
    as it may fail to initialize these values if the if condition was ever
    false (meaning the callbacks don't exist). It's not wrong because the
    callbacks (get_timestamp and config_sub_second_increment respectively)
    are the ones that initialize the variables. While it's unlikely that the
    callbacks are ever going to disappear and make that condition false, we
    can easily avoid this warning by zero initialize the variables.

    Link: https://github.com/ClangBuiltLinux/linux/issues/384
    Suggested-by: Nick Desaulniers
    Reviewed-by: Nick Desaulniers
    Signed-off-by: Nathan Chancellor
    Signed-off-by: David S. Miller
    Signed-off-by: Sasha Levin

    Nathan Chancellor
     
  • [ Upstream commit 32a5ad9c22852e6bd9e74bdec5934ef9d1480bc5 ]

    Currently, when writing

    echo 18446744073709551616 > /proc/sys/fs/file-max

    /proc/sys/fs/file-max will overflow and be set to 0. That quickly
    crashes the system.

    This commit sets the max and min value for file-max. The max value is
    set to long int. Any higher value cannot currently be used as the
    percpu counters are long ints and not unsigned integers.

    Note that the file-max value is ultimately parsed via
    __do_proc_doulongvec_minmax(). This function does not report error when
    min or max are exceeded. Which means if a value largen that long int is
    written userspace will not receive an error instead the old value will be
    kept. There is an argument to be made that this should be changed and
    __do_proc_doulongvec_minmax() should return an error when a dedicated min
    or max value are exceeded. However this has the potential to break
    userspace so let's defer this to an RFC patch.

    Link: http://lkml.kernel.org/r/20190107222700.15954-3-christian@brauner.io
    Signed-off-by: Christian Brauner
    Acked-by: Kees Cook
    Cc: Alexey Dobriyan
    Cc: Al Viro
    Cc: Dominik Brodowski
    Cc: "Eric W. Biederman"
    Cc: Joe Lawrence
    Cc: Luis Chamberlain
    Cc: Waiman Long
    [christian@brauner.io: v4]
    Link: http://lkml.kernel.org/r/20190210203943.8227-3-christian@brauner.io
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Christian Brauner
     
  • [ Upstream commit 62461ac2e5b6520b6d65fc6d7d7b4b8df4b848d8 ]

    The percpu member of this structure is declared as:
    struct ... ** __percpu member;
    So its type is:
    __percpu pointer to pointer to struct ...

    But looking at how it's used, its type should be:
    pointer to __percpu pointer to struct ...
    and it should thus be declared as:
    struct ... * __percpu *member;

    So fix the placement of '__percpu' in the definition of this
    structures.

    This silents a few Sparse's warnings like:
    warning: incorrect type in initializer (different address spaces)
    expected void const [noderef] *__vpp_verify
    got struct sched_domain **

    Link: http://lkml.kernel.org/r/20190118144902.79065-1-luc.vanoostenryck@gmail.com
    Fixes: 017c59c042d01 ("relay: Use per CPU constructs for the relay channel buffer pointers")
    Signed-off-by: Luc Van Oostenryck
    Cc: Jens Axboe
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Luc Van Oostenryck
     
  • [ Upstream commit d01849f7deba81f4959fd9e51bf20dbf46987d1c ]

    Tony notes that the GPIO module does not idle when level interrupts are
    in use, as the wakeup appears to get stuck.

    After extensive investigation, it appears that the wakeup will only be
    cleared if the interrupt status register is cleared while the interrupt
    is enabled. However, we are currently clearing it with the interrupt
    disabled for level-based interrupts.

    It is acknowledged that this observed behaviour conflicts with a
    statement in the TRM:

    CAUTION
    After servicing the interrupt, the status bit in the interrupt status
    register (GPIOi.GPIO_IRQSTATUS_0 or GPIOi.GPIO_IRQSTATUS_1) must be
    reset and the interrupt line released (by setting the corresponding
    bit of the interrupt status register to 1) before enabling an
    interrupt for the GPIO channel in the interrupt-enable register
    (GPIOi.GPIO_IRQSTATUS_SET_0 or GPIOi.GPIO_IRQSTATUS_SET_1) to prevent
    the occurrence of unexpected interrupts when enabling an interrupt
    for the GPIO channel.

    However, this does not appear to be a practical problem.

    Further, as reported by Grygorii Strashko ,
    the TI Android kernel tree has an earlier similar patch as "GPIO: OMAP:
    Fix the sequence to clear the IRQ status" saying:

    if the status is cleared after disabling the IRQ then sWAKEUP will not
    be cleared and gates the module transition

    When we unmask the level interrupt after the interrupt has been handled,
    enable the interrupt and only then clear the interrupt. If the interrupt
    is still pending, the hardware will re-assert the interrupt status.

    Should the caution note in the TRM prove to be a problem, we could
    use a clear-enable-clear sequence instead.

    Cc: Aaro Koskinen
    Cc: Keerthy
    Cc: Peter Ujfalusi
    Signed-off-by: Russell King
    [tony@atomide.com: updated comments based on an earlier TI patch]
    Signed-off-by: Tony Lindgren
    Acked-by: Grygorii Strashko
    Signed-off-by: Linus Walleij
    Signed-off-by: Sasha Levin

    Russell King
     
  • [ Upstream commit 6e77c413e8e73d0f36b5358b601389d75ec4451c ]

    If we try to set VFs mac address on a VF (not PF) net device,
    the kernel will be crash. The commands are show as below:

    $ echo 2 > /sys/class/net/$MLX_PF0/device/sriov_numvfs
    $ ip link set $MLX_VF0 vf 0 mac 00:11:22:33:44:00

    [exception RIP: mlx5_eswitch_set_vport_mac+41]
    [ffffb8b7079e3688] do_setlink at ffffffff8f67f85b
    [ffffb8b7079e37a8] __rtnl_newlink at ffffffff8f683778
    [ffffb8b7079e3b68] rtnl_newlink at ffffffff8f683a63
    [ffffb8b7079e3b90] rtnetlink_rcv_msg at ffffffff8f67d812
    [ffffb8b7079e3c10] netlink_rcv_skb at ffffffff8f6b88ab
    [ffffb8b7079e3c60] netlink_unicast at ffffffff8f6b808f
    [ffffb8b7079e3ca0] netlink_sendmsg at ffffffff8f6b8412
    [ffffb8b7079e3d18] sock_sendmsg at ffffffff8f6452f6
    [ffffb8b7079e3d30] ___sys_sendmsg at ffffffff8f645860
    [ffffb8b7079e3eb0] __sys_sendmsg at ffffffff8f647a38
    [ffffb8b7079e3f38] do_syscall_64 at ffffffff8f00401b
    [ffffb8b7079e3f50] entry_SYSCALL_64_after_hwframe at ffffffff8f80008c

    and

    [exception RIP: mlx5_eswitch_get_vport_config+12]
    [ffffa70607e57678] mlx5e_get_vf_config at ffffffffc03c7f8f [mlx5_core]
    [ffffa70607e57688] do_setlink at ffffffffbc67fa59
    [ffffa70607e577a8] __rtnl_newlink at ffffffffbc683778
    [ffffa70607e57b68] rtnl_newlink at ffffffffbc683a63
    [ffffa70607e57b90] rtnetlink_rcv_msg at ffffffffbc67d812
    [ffffa70607e57c10] netlink_rcv_skb at ffffffffbc6b88ab
    [ffffa70607e57c60] netlink_unicast at ffffffffbc6b808f
    [ffffa70607e57ca0] netlink_sendmsg at ffffffffbc6b8412
    [ffffa70607e57d18] sock_sendmsg at ffffffffbc6452f6
    [ffffa70607e57d30] ___sys_sendmsg at ffffffffbc645860
    [ffffa70607e57eb0] __sys_sendmsg at ffffffffbc647a38
    [ffffa70607e57f38] do_syscall_64 at ffffffffbc00401b
    [ffffa70607e57f50] entry_SYSCALL_64_after_hwframe at ffffffffbc80008c

    Fixes: a8d70a054a718 ("net/mlx5: E-Switch, Disallow vlan/spoofcheck setup if not being esw manager")
    Cc: Eli Cohen
    Signed-off-by: Tonghao Zhang
    Reviewed-by: Roi Dayan
    Acked-by: Saeed Mahameed
    Signed-off-by: Saeed Mahameed
    Signed-off-by: Sasha Levin

    Tonghao Zhang
     
  • [ Upstream commit 24319258660a84dd77f4be026a55b10a12524919 ]

    If we try to set VFs rate on a VF (not PF) net device, the kernel
    will be crash. The commands are show as below:

    $ echo 2 > /sys/class/net/$MLX_PF0/device/sriov_numvfs
    $ ip link set $MLX_VF0 vf 0 max_tx_rate 2 min_tx_rate 1

    If not applied the first patch ("net/mlx5: Avoid panic when setting
    vport mac, getting vport config"), the command:

    $ ip link set $MLX_VF0 vf 0 rate 100

    can also crash the kernel.

    [ 1650.006388] RIP: 0010:mlx5_eswitch_set_vport_rate+0x1f/0x260 [mlx5_core]
    [ 1650.007092] do_setlink+0x982/0xd20
    [ 1650.007129] __rtnl_newlink+0x528/0x7d0
    [ 1650.007374] rtnl_newlink+0x43/0x60
    [ 1650.007407] rtnetlink_rcv_msg+0x2a2/0x320
    [ 1650.007484] netlink_rcv_skb+0xcb/0x100
    [ 1650.007519] netlink_unicast+0x17f/0x230
    [ 1650.007554] netlink_sendmsg+0x2d2/0x3d0
    [ 1650.007592] sock_sendmsg+0x36/0x50
    [ 1650.007625] ___sys_sendmsg+0x280/0x2a0
    [ 1650.007963] __sys_sendmsg+0x58/0xa0
    [ 1650.007998] do_syscall_64+0x5b/0x180
    [ 1650.009438] entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Fixes: c9497c98901c ("net/mlx5: Add support for setting VF min rate")
    Cc: Mohamad Haj Yahia
    Signed-off-by: Tonghao Zhang
    Reviewed-by: Roi Dayan
    Acked-by: Saeed Mahameed
    Signed-off-by: Saeed Mahameed
    Signed-off-by: Sasha Levin

    Tonghao Zhang
     
  • [ Upstream commit 31b265b3baaf55f209229888b7ffea523ddab366 ]

    As reported back in 2016-11 [1], the "ftdump" kdb command triggers a
    BUG for "sleeping function called from invalid context".

    kdb's "ftdump" command wants to call ring_buffer_read_prepare() in
    atomic context. A very simple solution for this is to add allocation
    flags to ring_buffer_read_prepare() so kdb can call it without
    triggering the allocation error. This patch does that.

    Note that in the original email thread about this, it was suggested
    that perhaps the solution for kdb was to either preallocate the buffer
    ahead of time or create our own iterator. I'm hoping that this
    alternative of adding allocation flags to ring_buffer_read_prepare()
    can be considered since it means I don't need to duplicate more of the
    core trace code into "trace_kdb.c" (for either creating my own
    iterator or re-preparing a ring allocator whose memory was already
    allocated).

    NOTE: another option for kdb is to actually figure out how to make it
    reuse the existing ftrace_dump() function and totally eliminate the
    duplication. This sounds very appealing and actually works (the "sr
    z" command can be seen to properly dump the ftrace buffer). The
    downside here is that ftrace_dump() fully consumes the trace buffer.
    Unless that is changed I'd rather not use it because it means "ftdump
    | grep xyz" won't be very useful to search the ftrace buffer since it
    will throw away the whole trace on the first grep. A future patch to
    dump only the last few lines of the buffer will also be hard to
    implement.

    [1] https://lkml.kernel.org/r/20161117191605.GA21459@google.com

    Link: http://lkml.kernel.org/r/20190308193205.213659-1-dianders@chromium.org

    Reported-by: Brian Norris
    Signed-off-by: Douglas Anderson
    Signed-off-by: Steven Rostedt (VMware)
    Signed-off-by: Sasha Levin

    Douglas Anderson
     
  • [ Upstream commit aadcef64b22f668c1a107b86d3521d9cac915c24 ]

    As Jiqun Li reported in bugzilla:

    https://bugzilla.kernel.org/show_bug.cgi?id=202883

    sometimes, dead lock when make system call SYS_getdents64 with fsync() is
    called by another process.

    monkey running on android9.0

    1. task 9785 held sbi->cp_rwsem and waiting lock_page()
    2. task 10349 held mm_sem and waiting sbi->cp_rwsem
    3. task 9709 held lock_page() and waiting mm_sem

    so this is a dead lock scenario.

    task stack is show by crash tools as following

    crash_arm64> bt ffffffc03c354080
    PID: 9785 TASK: ffffffc03c354080 CPU: 1 COMMAND: "RxIoScheduler-3"
    >> #7 [ffffffc01b50fac0] __lock_page at ffffff80081b11e8

    crash-arm64> bt 10349
    PID: 10349 TASK: ffffffc018b83080 CPU: 1 COMMAND: "BUGLY_ASYNC_UPL"
    >> #3 [ffffffc01f8cfa40] rwsem_down_read_failed at ffffff8008a93afc
    PC: 00000033 LR: 00000000 SP: 00000000 PSTATE: ffffffffffffffff

    crash-arm64> bt 9709
    PID: 9709 TASK: ffffffc03e7f3080 CPU: 1 COMMAND: "IntentService[A"
    >> #3 [ffffffc001e67850] rwsem_down_read_failed at ffffff8008a93afc
    >> #8 [ffffffc001e67b80] el1_ia at ffffff8008084fc4
    PC: ffffff8008274114 [compat_filldir64+120]
    LR: ffffff80083584d4 [f2fs_fill_dentries+448]
    SP: ffffffc001e67b80 PSTATE: 80400145
    X29: ffffffc001e67b80 X28: 0000000000000000 X27: 000000000000001a
    X26: 00000000000093d7 X25: ffffffc070d52480 X24: 0000000000000008
    X23: 0000000000000028 X22: 00000000d43dfd60 X21: ffffffc001e67e90
    X20: 0000000000000011 X19: ffffff80093a4000 X18: 0000000000000000
    X17: 0000000000000000 X16: 0000000000000000 X15: 0000000000000000
    X14: ffffffffffffffff X13: 0000000000000008 X12: 0101010101010101
    X11: 7f7f7f7f7f7f7f7f X10: 6a6a6a6a6a6a6a6a X9: 7f7f7f7f7f7f7f7f
    X8: 0000000080808000 X7: ffffff800827409c X6: 0000000080808000
    X5: 0000000000000008 X4: 00000000000093d7 X3: 000000000000001a
    X2: 0000000000000011 X1: ffffffc070d52480 X0: 0000000000800238
    >> #9 [ffffffc001e67be0] f2fs_fill_dentries at ffffff80083584d0
    PC: 0000003c LR: 00000000 SP: 00000000 PSTATE: 000000d9
    X12: f48a02ff X11: d4678960 X10: d43dfc00 X9: d4678ae4
    X8: 00000058 X7: d4678994 X6: d43de800 X5: 000000d9
    X4: d43dfc0c X3: d43dfc10 X2: d46799c8 X1: 00000000
    X0: 00001068

    Below potential deadlock will happen between three threads:
    Thread A Thread B Thread C
    - f2fs_do_sync_file
    - f2fs_write_checkpoint
    - down_write(&sbi->node_change) -- 1)
    - do_page_fault
    - down_write(&mm->mmap_sem) -- 2)
    - do_wp_page
    - f2fs_vm_page_mkwrite
    - getdents64
    - f2fs_read_inline_dir
    - lock_page -- 3)
    - f2fs_sync_node_pages
    - lock_page -- 3)
    - __do_map_lock
    - down_read(&sbi->node_change) -- 1)
    - f2fs_fill_dentries
    - dir_emit
    - compat_filldir64
    - do_page_fault
    - down_read(&mm->mmap_sem) -- 2)

    Since f2fs_readdir is protected by inode.i_rwsem, there should not be
    any updates in inode page, we're safe to lookup dents in inode page
    without its lock held, so taking off the lock to improve concurrency
    of readdir and avoid potential deadlock.

    Reported-by: Jiqun Li
    Signed-off-by: Chao Yu
    Signed-off-by: Jaegeuk Kim
    Signed-off-by: Sasha Levin

    Chao Yu
     
  • [ Upstream commit 2c28aba8b2e2a51749fa66e01b68e1cd5b53e022 ]

    With below testcase, we will fail to find existed xattr entry:

    1. mkfs.f2fs -O extra_attr -O flexible_inline_xattr /dev/zram0
    2. mount -t f2fs -o inline_xattr_size=1 /dev/zram0 /mnt/f2fs/
    3. touch /mnt/f2fs/file
    4. setfattr -n "user.name" -v 0 /mnt/f2fs/file
    5. getfattr -n "user.name" /mnt/f2fs/file

    /mnt/f2fs/file: user.name: No such attribute

    The reason is for inode which has very small inline xattr size,
    __find_inline_xattr() will fail to traverse any entry due to first
    entry may not be loaded from xattr node yet, later, we may skip to
    check entire xattr datas in __find_xattr(), result in such wrong
    condition.

    This patch adds condition to check such case to avoid this issue.

    Signed-off-by: Chao Yu
    Signed-off-by: Jaegeuk Kim
    Signed-off-by: Sasha Levin

    Chao Yu
     
  • [ Upstream commit fc2b47b55f17fd996f7a01975ce1c33c2f2513f6 ]

    It believe it is a bad idea to hardcode a specific compiler prefix
    that may or may not be installed on a user's system. It is annoying
    when testing features that should not require compilers at all.

    For example, mrproper, headers_install, etc. should work without
    any compiler.

    They look like follows on my machine.

    $ make ARCH=h8300 mrproper
    ./scripts/gcc-version.sh: line 26: h8300-unknown-linux-gcc: command not found
    ./scripts/gcc-version.sh: line 27: h8300-unknown-linux-gcc: command not found
    make: h8300-unknown-linux-gcc: Command not found
    make: h8300-unknown-linux-gcc: Command not found
    [ a bunch of the same error messages continue ]

    $ make ARCH=h8300 headers_install
    ./scripts/gcc-version.sh: line 26: h8300-unknown-linux-gcc: command not found
    ./scripts/gcc-version.sh: line 27: h8300-unknown-linux-gcc: command not found
    make: h8300-unknown-linux-gcc: Command not found
    HOSTCC scripts/basic/fixdep
    make: h8300-unknown-linux-gcc: Command not found
    WRAP arch/h8300/include/generated/uapi/asm/kvm_para.h
    [ snip ]

    The solution is to delete this line, or to use cc-cross-prefix like
    some architectures do. I chose the latter as a moderate fixup.

    I added an alternative 'h8300-linux-' because it is available at:

    https://mirrors.edge.kernel.org/pub/tools/crosstool/files/bin/x86_64/8.1.0/

    Signed-off-by: Masahiro Yamada
    Signed-off-by: Sasha Levin

    Masahiro Yamada
     
  • [ Upstream commit bc31d0cdcfbadb6258b45db97e93b1c83822ba33 ]

    We have a customer reporting crashes in lock_get_status() with many
    "Leaked POSIX lock" messages preceeding the crash.

    Leaked POSIX lock on dev=0x0:0x56 ...
    Leaked POSIX lock on dev=0x0:0x56 ...
    Leaked POSIX lock on dev=0x0:0x56 ...
    Leaked POSIX lock on dev=0x0:0x53 ...
    Leaked POSIX lock on dev=0x0:0x53 ...
    Leaked POSIX lock on dev=0x0:0x53 ...
    Leaked POSIX lock on dev=0x0:0x53 ...
    POSIX: fl_owner=ffff8900e7b79380 fl_flags=0x1 fl_type=0x1 fl_pid=20709
    Leaked POSIX lock on dev=0x0:0x4b ino...
    Leaked locks on dev=0x0:0x4b ino=0xf911400000029:
    POSIX: fl_owner=ffff89f41c870e00 fl_flags=0x1 fl_type=0x1 fl_pid=19592
    stack segment: 0000 [#1] SMP
    Modules linked in: binfmt_misc msr tcp_diag udp_diag inet_diag unix_diag af_packet_diag netlink_diag rpcsec_gss_krb5 arc4 ecb auth_rpcgss nfsv4 md4 nfs nls_utf8 lockd grace cifs sunrpc ccm dns_resolver fscache af_packet iscsi_ibft iscsi_boot_sysfs vmw_vsock_vmci_transport vsock xfs libcrc32c sb_edac edac_core crct10dif_pclmul crc32_pclmul ghash_clmulni_intel drbg ansi_cprng vmw_balloon aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd joydev pcspkr vmxnet3 i2c_piix4 vmw_vmci shpchp fjes processor button ac btrfs xor raid6_pq sr_mod cdrom ata_generic sd_mod ata_piix vmwgfx crc32c_intel drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm serio_raw ahci libahci drm libata vmw_pvscsi sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod autofs4

    Supported: Yes
    CPU: 6 PID: 28250 Comm: lsof Not tainted 4.4.156-94.64-default #1
    Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/05/2016
    task: ffff88a345f28740 ti: ffff88c74005c000 task.ti: ffff88c74005c000
    RIP: 0010:[] [] lock_get_status+0x9b/0x3b0
    RSP: 0018:ffff88c74005fd90 EFLAGS: 00010202
    RAX: ffff89bde83e20ae RBX: ffff89e870003d18 RCX: 0000000049534f50
    RDX: ffffffff81a3541f RSI: ffffffff81a3544e RDI: ffff89bde83e20ae
    RBP: 0026252423222120 R08: 0000000020584953 R09: 000000000000ffff
    R10: 0000000000000000 R11: ffff88c74005fc70 R12: ffff89e5ca7b1340
    R13: 00000000000050e5 R14: ffff89e870003d30 R15: ffff89e5ca7b1340
    FS: 00007fafd64be800(0000) GS:ffff89f41fd00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000001c80018 CR3: 000000a522048000 CR4: 0000000000360670
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Stack:
    0000000000000208 ffffffff81a3d6b6 ffff89e870003d30 ffff89e870003d18
    ffff89e5ca7b1340 ffff89f41738d7c0 ffff89e870003d30 ffff89e5ca7b1340
    ffffffff8125e08f 0000000000000000 ffff89bc22b67d00 ffff88c74005ff28
    Call Trace:
    [] locks_show+0x2f/0x70
    [] seq_read+0x251/0x3a0
    [] proc_reg_read+0x3c/0x70
    [] __vfs_read+0x26/0x140
    [] vfs_read+0x7a/0x120
    [] SyS_read+0x42/0xa0
    [] entry_SYSCALL_64_fastpath+0x1e/0xb7

    When Linux closes a FD (close(), close-on-exec, dup2(), ...) it calls
    filp_close() which also removes all posix locks.

    The lock struct is initialized like so in filp_close() and passed
    down to cifs

    ...
    lock.fl_type = F_UNLCK;
    lock.fl_flags = FL_POSIX | FL_CLOSE;
    lock.fl_start = 0;
    lock.fl_end = OFFSET_MAX;
    ...

    Note the FL_CLOSE flag, which hints the VFS code that this unlocking
    is done for closing the fd.

    filp_close()
    locks_remove_posix(filp, id);
    vfs_lock_file(filp, F_SETLK, &lock, NULL);
    return filp->f_op->lock(filp, cmd, fl) => cifs_lock()
    rc = cifs_setlk(file, flock, type, wait_flag, posix_lck, lock, unlock, xid);
    rc = server->ops->mand_unlock_range(cfile, flock, xid);
    if (flock->fl_flags & FL_POSIX && !rc)
    rc = locks_lock_file_wait(file, flock)

    Notice how we don't call locks_lock_file_wait() which does the
    generic VFS lock/unlock/wait work on the inode if rc != 0.

    If we are closing the handle, the SMB server is supposed to remove any
    locks associated with it. Similarly, cifs.ko frees and wakes up any
    lock and lock waiter when closing the file:

    cifs_close()
    cifsFileInfo_put(file->private_data)
    /*
    * Delete any outstanding lock records. We'll lose them when the file
    * is closed anyway.
    */
    down_write(&cifsi->lock_sem);
    list_for_each_entry_safe(li, tmp, &cifs_file->llist->locks, llist) {
    list_del(&li->llist);
    cifs_del_lock_waiters(li);
    kfree(li);
    }
    list_del(&cifs_file->llist->llist);
    kfree(cifs_file->llist);
    up_write(&cifsi->lock_sem);

    So we can safely ignore unlocking failures in cifs_lock() if they
    happen with the FL_CLOSE flag hint set as both the server and the
    client take care of it during the actual closing.

    This is not a proper fix for the unlocking failure but it's safe and
    it seems to prevent the lock leakages and crashes the customer
    experiences.

    Signed-off-by: Aurelien Aptel
    Signed-off-by: NeilBrown
    Signed-off-by: Steve French
    Acked-by: Pavel Shilovsky
    Signed-off-by: Sasha Levin

    Aurelien Aptel