22 Jun, 2020

40 commits

  • commit 8b8c704d913b0fe490af370631a4200e26334ec0 upstream.

    Commit 6cc7c266e5b4 ("ima: Call ima_calc_boot_aggregate() in
    ima_eventdigest_init()") added a call to ima_calc_boot_aggregate() so that
    the digest can be recalculated for the boot_aggregate measurement entry if
    the 'd' template field has been requested. For the 'd' field, only SHA1 and
    MD5 digests are accepted.

    Given that ima_eventdigest_init() does not have the __init annotation, all
    functions called should not have it. This patch removes __init from
    ima_pcrread().

    Cc: stable@vger.kernel.org
    Fixes: 6cc7c266e5b4 ("ima: Call ima_calc_boot_aggregate() in ima_eventdigest_init()")
    Reported-by: Linus Torvalds
    Signed-off-by: Roberto Sassu
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Roberto Sassu
     
  • commit 6cc7c266e5b47d3cd2b5bb7fd3aac4e6bb2dd1d2 upstream.

    If the template field 'd' is chosen and the digest to be added to the
    measurement entry was not calculated with SHA1 or MD5, it is
    recalculated with SHA1, by using the passed file descriptor. However, this
    cannot be done for boot_aggregate, because there is no file descriptor.

    This patch adds a call to ima_calc_boot_aggregate() in
    ima_eventdigest_init(), so that the digest can be recalculated also for the
    boot_aggregate entry.

    Cc: stable@vger.kernel.org # 3.13.x
    Fixes: 3ce1217d6cd5d ("ima: define template fields library and new helpers")
    Reported-by: Takashi Iwai
    Signed-off-by: Roberto Sassu
    Signed-off-by: Mimi Zohar
    Signed-off-by: Greg Kroah-Hartman

    Roberto Sassu
     
  • commit 067a436b1b0aafa593344fddd711a755a58afb3b upstream.

    This patch prevents the following oops:

    [ 10.771813] BUG: kernel NULL pointer dereference, address: 0000000000000
    [...]
    [ 10.779790] RIP: 0010:ima_match_policy+0xf7/0xb80
    [...]
    [ 10.798576] Call Trace:
    [ 10.798993] ? ima_lsm_policy_change+0x2b0/0x2b0
    [ 10.799753] ? inode_init_owner+0x1a0/0x1a0
    [ 10.800484] ? _raw_spin_lock+0x7a/0xd0
    [ 10.801592] ima_must_appraise.part.0+0xb6/0xf0
    [ 10.802313] ? ima_fix_xattr.isra.0+0xd0/0xd0
    [ 10.803167] ima_must_appraise+0x4f/0x70
    [ 10.804004] ima_post_path_mknod+0x2e/0x80
    [ 10.804800] do_mknodat+0x396/0x3c0

    It occurs when there is a failure during IMA initialization, and
    ima_init_policy() is not called. IMA hooks still call ima_match_policy()
    but ima_rules is NULL. This patch prevents the crash by directly assigning
    the ima_default_policy pointer to ima_rules when ima_rules is defined. This
    wouldn't alter the existing behavior, as ima_rules is always set at the end
    of ima_init_policy().

    Cc: stable@vger.kernel.org # 3.7.x
    Fixes: 07f6a79415d7d ("ima: add appraise action keywords and default rules")
    Reported-by: Takashi Iwai
    Signed-off-by: Roberto Sassu
    Signed-off-by: Mimi Zohar
    Signed-off-by: Greg Kroah-Hartman

    Roberto Sassu
     
  • commit e144d6b265415ddbdc54b3f17f4f95133effa5a8 upstream.

    Evaluate error in init_ima() before register_blocking_lsm_notifier() and
    return if not zero.

    Cc: stable@vger.kernel.org # 5.3.x
    Fixes: b16942455193 ("ima: use the lsm policy update notifier")
    Signed-off-by: Roberto Sassu
    Reviewed-by: James Morris
    Signed-off-by: Mimi Zohar
    Signed-off-by: Greg Kroah-Hartman

    Roberto Sassu
     
  • commit 6f1a1d103b48b1533a9c804e7a069e2c8e937ce7 upstream.

    boot_aggregate is the first entry of IMA measurement list. Its purpose is
    to link pre-boot measurements to IMA measurements. As IMA was designed to
    work with a TPM 1.2, the SHA1 PCR bank was always selected even if a
    TPM 2.0 with support for stronger hash algorithms is available.

    This patch first tries to find a PCR bank with the IMA default hash
    algorithm. If it does not find it, it selects the SHA256 PCR bank for
    TPM 2.0 and SHA1 for TPM 1.2. Ultimately, it selects SHA1 also for TPM 2.0
    if the SHA256 PCR bank is not found.

    If none of the PCR banks above can be found, boot_aggregate file digest is
    filled with zeros, as for TPM bypass, making it impossible to perform a
    remote attestation of the system.

    Cc: stable@vger.kernel.org # 5.1.x
    Fixes: 879b589210a9 ("tpm: retrieve digest size of unknown algorithms with PCR read")
    Reported-by: Jerry Snitselaar
    Suggested-by: James Bottomley
    Signed-off-by: Roberto Sassu
    Signed-off-by: Mimi Zohar
    Signed-off-by: Greg Kroah-Hartman

    Roberto Sassu
     
  • commit 1129d31b55d509f15e72dc68e4b5c3a4d7b4da8d upstream.

    Function hash_long() accepts unsigned long, while currently only one byte
    is passed from ima_hash_key(), which calculates a key for ima_htable.

    Given that hashing the digest does not give clear benefits compared to
    using the digest itself, remove hash_long() and return the modulus
    calculated on the first two bytes of the digest with the number of slots.
    Also reduce the depth of the hash table by doubling the number of slots.

    Cc: stable@vger.kernel.org
    Fixes: 3323eec921ef ("integrity: IMA as an integrity service provider")
    Co-developed-by: Roberto Sassu
    Signed-off-by: Roberto Sassu
    Signed-off-by: Krzysztof Struczynski
    Acked-by: David.Laight@aculab.com (big endian system concerns)
    Signed-off-by: Mimi Zohar
    Signed-off-by: Greg Kroah-Hartman

    Krzysztof Struczynski
     
  • commit da97f2d56bbd880b4138916a7ef96f9881a551b2 upstream.

    Now that deferred pages are initialized with interrupts enabled we can
    replace touch_nmi_watchdog() with cond_resched(), as it was before
    3a2d7fa8a3d5.

    For now, we cannot do the same in deferred_grow_zone() as it is still
    initializes pages with interrupts disabled.

    This change fixes RCU problem described in
    https://lkml.kernel.org/r/20200401104156.11564-2-david@redhat.com

    [ 60.474005] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
    [ 60.475000] rcu: 1-...0: (0 ticks this GP) idle=02a/1/0x4000000000000000 softirq=1/1 fqs=15000
    [ 60.475000] rcu: (detected by 0, t=60002 jiffies, g=-1199, q=1)
    [ 60.475000] Sending NMI from CPU 0 to CPUs 1:
    [ 1.760091] NMI backtrace for cpu 1
    [ 1.760091] CPU: 1 PID: 20 Comm: pgdatinit0 Not tainted 4.18.0-147.9.1.el8_1.x86_64 #1
    [ 1.760091] Hardware name: Red Hat KVM, BIOS 1.13.0-1.module+el8.2.0+5520+4e5817f3 04/01/2014
    [ 1.760091] RIP: 0010:__init_single_page.isra.65+0x10/0x4f
    [ 1.760091] Code: 48 83 cf 63 48 89 f8 0f 1f 40 00 48 89 c6 48 89 d7 e8 6b 18 80 ff 66 90 5b c3 31 c0 b9 10 00 00 00 49 89 f8 48 c1 e6 33 f3 ab 07 00 00 00 48 c1 e2 36 41 c7 40 34 01 00 00 00 48 c1 e0 33 41
    [ 1.760091] RSP: 0000:ffffba783123be40 EFLAGS: 00000006
    [ 1.760091] RAX: 0000000000000000 RBX: fffffad34405e300 RCX: 0000000000000000
    [ 1.760091] RDX: 0000000000000000 RSI: 0010000000000000 RDI: fffffad34405e340
    [ 1.760091] RBP: 0000000033f3177e R08: fffffad34405e300 R09: 0000000000000002
    [ 1.760091] R10: 000000000000002b R11: ffff98afb691a500 R12: 0000000000000002
    [ 1.760091] R13: 0000000000000000 R14: 000000003f03ea00 R15: 000000003e10178c
    [ 1.760091] FS: 0000000000000000(0000) GS:ffff9c9ebeb00000(0000) knlGS:0000000000000000
    [ 1.760091] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 1.760091] CR2: 00000000ffffffff CR3: 000000a1cf20a001 CR4: 00000000003606e0
    [ 1.760091] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [ 1.760091] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [ 1.760091] Call Trace:
    [ 1.760091] deferred_init_pages+0x8f/0xbf
    [ 1.760091] deferred_init_memmap+0x184/0x29d
    [ 1.760091] ? deferred_free_pages.isra.97+0xba/0xba
    [ 1.760091] kthread+0x112/0x130
    [ 1.760091] ? kthread_flush_work_fn+0x10/0x10
    [ 1.760091] ret_from_fork+0x35/0x40
    [ 89.123011] node 0 initialised, 1055935372 pages in 88650ms

    Fixes: 3a2d7fa8a3d5 ("mm: disable interrupts while initializing deferred pages")
    Reported-by: Yiqian Wei
    Signed-off-by: Pavel Tatashin
    Signed-off-by: Andrew Morton
    Tested-by: David Hildenbrand
    Reviewed-by: Daniel Jordan
    Reviewed-by: David Hildenbrand
    Reviewed-by: Pankaj Gupta
    Acked-by: Michal Hocko
    Cc: Dan Williams
    Cc: James Morris
    Cc: Kirill Tkhai
    Cc: Sasha Levin
    Cc: Shile Zhang
    Cc: Vlastimil Babka
    Cc: [4.17+]
    Link: http://lkml.kernel.org/r/20200403140952.17177-4-pasha.tatashin@soleen.com
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Pavel Tatashin
     
  • commit 117003c32771df617acf66e140fbdbdeb0ac71f5 upstream.

    Patch series "initialize deferred pages with interrupts enabled", v4.

    Keep interrupts enabled during deferred page initialization in order to
    make code more modular and allow jiffies to update.

    Original approach, and discussion can be found here:
    http://lkml.kernel.org/r/20200311123848.118638-1-shile.zhang@linux.alibaba.com

    This patch (of 3):

    deferred_init_memmap() disables interrupts the entire time, so it calls
    touch_nmi_watchdog() periodically to avoid soft lockup splats. Soon it
    will run with interrupts enabled, at which point cond_resched() should be
    used instead.

    deferred_grow_zone() makes the same watchdog calls through code shared
    with deferred init but will continue to run with interrupts disabled, so
    it can't call cond_resched().

    Pull the watchdog calls up to these two places to allow the first to be
    changed later, independently of the second. The frequency reduces from
    twice per pageblock (init and free) to once per max order block.

    Fixes: 3a2d7fa8a3d5 ("mm: disable interrupts while initializing deferred pages")
    Signed-off-by: Daniel Jordan
    Signed-off-by: Pavel Tatashin
    Signed-off-by: Andrew Morton
    Reviewed-by: David Hildenbrand
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Dan Williams
    Cc: Shile Zhang
    Cc: Kirill Tkhai
    Cc: James Morris
    Cc: Sasha Levin
    Cc: Yiqian Wei
    Cc: [4.17+]
    Link: http://lkml.kernel.org/r/20200403140952.17177-2-pasha.tatashin@soleen.com
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Daniel Jordan
     
  • commit 3d060856adfc59afb9d029c233141334cfaba418 upstream.

    Initializing struct pages is a long task and keeping interrupts disabled
    for the duration of this operation introduces a number of problems.

    1. jiffies are not updated for long period of time, and thus incorrect time
    is reported. See proposed solution and discussion here:
    lkml/20200311123848.118638-1-shile.zhang@linux.alibaba.com
    2. It prevents farther improving deferred page initialization by allowing
    intra-node multi-threading.

    We are keeping interrupts disabled to solve a rather theoretical problem
    that was never observed in real world (See 3a2d7fa8a3d5).

    Let's keep interrupts enabled. In case we ever encounter a scenario where
    an interrupt thread wants to allocate large amount of memory this early in
    boot we can deal with that by growing zone (see deferred_grow_zone()) by
    the needed amount before starting deferred_init_memmap() threads.

    Before:
    [ 1.232459] node 0 initialised, 12058412 pages in 1ms

    After:
    [ 1.632580] node 0 initialised, 12051227 pages in 436ms

    Fixes: 3a2d7fa8a3d5 ("mm: disable interrupts while initializing deferred pages")
    Reported-by: Shile Zhang
    Signed-off-by: Pavel Tatashin
    Signed-off-by: Andrew Morton
    Reviewed-by: Daniel Jordan
    Reviewed-by: David Hildenbrand
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Dan Williams
    Cc: James Morris
    Cc: Kirill Tkhai
    Cc: Sasha Levin
    Cc: Yiqian Wei
    Cc: [4.17+]
    Link: http://lkml.kernel.org/r/20200403140952.17177-3-pasha.tatashin@soleen.com
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Pavel Tatashin
     
  • commit c444eb564fb16645c172d550359cb3d75fe8a040 upstream.

    Write protect anon page faults require an accurate mapcount to decide
    if to break the COW or not. This is implemented in the THP path with
    reuse_swap_page() ->
    page_trans_huge_map_swapcount()/page_trans_huge_mapcount().

    If the COW triggers while the other processes sharing the page are
    under a huge pmd split, to do an accurate reading, we must ensure the
    mapcount isn't computed while it's being transferred from the head
    page to the tail pages.

    reuse_swap_cache() already runs serialized by the page lock, so it's
    enough to add the page lock around __split_huge_pmd_locked too, in
    order to add the missing serialization.

    Note: the commit in "Fixes" is just to facilitate the backporting,
    because the code before such commit didn't try to do an accurate THP
    mapcount calculation and it instead used the page_count() to decide if
    to COW or not. Both the page_count and the pin_count are THP-wide
    refcounts, so they're inaccurate if used in
    reuse_swap_page(). Reverting such commit (besides the unrelated fix to
    the local anon_vma assignment) would have also opened the window for
    memory corruption side effects to certain workloads as documented in
    such commit header.

    Signed-off-by: Andrea Arcangeli
    Suggested-by: Jann Horn
    Reported-by: Jann Horn
    Acked-by: Kirill A. Shutemov
    Fixes: 6d0a07edd17c ("mm: thp: calculate the mapcount correctly for THP pages during WP faults")
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Andrea Arcangeli
     
  • commit 4e3319c23a66dabfd6c35f4d2633d64d99b68096 upstream.

    Setting init mem to NX shall depend on sinittext being mapped by
    block, not on stext being mapped by block.

    Setting text and rodata to RO shall depend on stext being mapped by
    block, not on sinittext being mapped by block.

    Fixes: 63b2bc619565 ("powerpc/mm/32s: Use BATs for STRICT_KERNEL_RWX")
    Cc: stable@vger.kernel.org
    Signed-off-by: Christophe Leroy
    Signed-off-by: Michael Ellerman
    Link: https://lore.kernel.org/r/7d565fb8f51b18a3d98445a830b2f6548cb2da2a.1589866984.git.christophe.leroy@csgroup.eu
    Signed-off-by: Greg Kroah-Hartman

    Christophe Leroy
     
  • commit 2166e5edce9ac1edf3b113d6091ef72fcac2d6c4 upstream.

    We always preallocate a data extent for writing a free space cache, which
    causes writeback to always try the nocow path first, since the free space
    inode has the prealloc bit set in its flags.

    However if the block group that contains the data extent for the space
    cache has been turned to RO mode due to a running scrub or balance for
    example, we have to fallback to the cow path. In that case once a new data
    extent is allocated we end up calling btrfs_add_reserved_bytes(), which
    decrements the counter named bytes_may_use from the data space_info object
    with the expection that this counter was previously incremented with the
    same amount (the size of the data extent).

    However when we started writeout of the space cache at cache_save_setup(),
    we incremented the value of the bytes_may_use counter through a call to
    btrfs_check_data_free_space() and then decremented it through a call to
    btrfs_prealloc_file_range_trans() immediately after. So when starting the
    writeback if we fallback to cow mode we have to increment the counter
    bytes_may_use of the data space_info again to compensate for the extent
    allocation done by the cow path.

    When this issue happens we are incorrectly decrementing the bytes_may_use
    counter and when its current value is smaller then the amount we try to
    subtract we end up with the following warning:

    ------------[ cut here ]------------
    WARNING: CPU: 3 PID: 657 at fs/btrfs/space-info.h:115 btrfs_add_reserved_bytes+0x3d6/0x4e0 [btrfs]
    Modules linked in: btrfs blake2b_generic xor raid6_pq libcrc32c (...)
    CPU: 3 PID: 657 Comm: kworker/u8:7 Tainted: G W 5.6.0-rc7-btrfs-next-58 #5
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
    Workqueue: writeback wb_workfn (flush-btrfs-1591)
    RIP: 0010:btrfs_add_reserved_bytes+0x3d6/0x4e0 [btrfs]
    Code: ff ff 48 (...)
    RSP: 0000:ffffa41608f13660 EFLAGS: 00010287
    RAX: 0000000000001000 RBX: ffff9615b93ae400 RCX: 0000000000000000
    RDX: 0000000000000002 RSI: 0000000000000000 RDI: ffff9615b96ab410
    RBP: fffffffffffee000 R08: 0000000000000001 R09: 0000000000000000
    R10: ffff961585e62a40 R11: 0000000000000000 R12: ffff9615b96ab400
    R13: ffff9615a1a2a000 R14: 0000000000012000 R15: ffff9615b93ae400
    FS: 0000000000000000(0000) GS:ffff9615bb200000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 000055cbbc2ae178 CR3: 0000000115794006 CR4: 00000000003606e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    find_free_extent+0x4a0/0x16c0 [btrfs]
    btrfs_reserve_extent+0x91/0x180 [btrfs]
    cow_file_range+0x12d/0x490 [btrfs]
    btrfs_run_delalloc_range+0x9f/0x6d0 [btrfs]
    ? find_lock_delalloc_range+0x221/0x250 [btrfs]
    writepage_delalloc+0xe8/0x150 [btrfs]
    __extent_writepage+0xe8/0x4c0 [btrfs]
    extent_write_cache_pages+0x237/0x530 [btrfs]
    extent_writepages+0x44/0xa0 [btrfs]
    do_writepages+0x23/0x80
    __writeback_single_inode+0x59/0x700
    writeback_sb_inodes+0x267/0x5f0
    __writeback_inodes_wb+0x87/0xe0
    wb_writeback+0x382/0x590
    ? wb_workfn+0x4a2/0x6c0
    wb_workfn+0x4a2/0x6c0
    process_one_work+0x26d/0x6a0
    worker_thread+0x4f/0x3e0
    ? process_one_work+0x6a0/0x6a0
    kthread+0x103/0x140
    ? kthread_create_worker_on_cpu+0x70/0x70
    ret_from_fork+0x3a/0x50
    irq event stamp: 0
    hardirqs last enabled at (0): [] 0x0
    hardirqs last disabled at (0): [] copy_process+0x74f/0x2020
    softirqs last enabled at (0): [] copy_process+0x74f/0x2020
    softirqs last disabled at (0): [] 0x0
    ---[ end trace bd7c03622e0b0a52 ]---
    ------------[ cut here ]------------

    So fix this by incrementing the bytes_may_use counter of the data
    space_info when we fallback to the cow path. If the cow path is successful
    the counter is decremented after extent allocation (by
    btrfs_add_reserved_bytes()), if it fails it ends up being decremented as
    well when clearing the delalloc range (extent_clear_unlock_delalloc()).

    This could be triggered sporadically by the test case btrfs/061 from
    fstests.

    Fixes: 82d5902d9c681b ("Btrfs: Support reading/writing on disk free ino cache")
    CC: stable@vger.kernel.org # 4.4+
    Signed-off-by: Filipe Manana
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Filipe Manana
     
  • commit 467dc47ea99c56e966e99d09dae54869850abeeb upstream.

    When doing a buffered write we always try to reserve data space for it,
    even when the file has the NOCOW bit set or the write falls into a file
    range covered by a prealloc extent. This is done both because it is
    expensive to check if we can do a nocow write (checking if an extent is
    shared through reflinks or if there's a hole in the range for example),
    and because when writeback starts we might actually need to fallback to
    COW mode (for example the block group containing the target extents was
    turned into RO mode due to a scrub or balance).

    When we are unable to reserve data space we check if we can do a nocow
    write, and if we can, we proceed with dirtying the pages and setting up
    the range for delalloc. In this case the bytes_may_use counter of the
    data space_info object is not incremented, unlike in the case where we
    are able to reserve data space (done through btrfs_check_data_free_space()
    which calls btrfs_alloc_data_chunk_ondemand()).

    Later when running delalloc we attempt to start writeback in nocow mode
    but we might revert back to cow mode, for example because in the meanwhile
    a block group was turned into RO mode by a scrub or relocation. The cow
    path after successfully allocating an extent ends up calling
    btrfs_add_reserved_bytes(), which expects the bytes_may_use counter of
    the data space_info object to have been incremented before - but we did
    not do it when the buffered write started, since there was not enough
    available data space. So btrfs_add_reserved_bytes() ends up decrementing
    the bytes_may_use counter anyway, and when the counter's current value
    is smaller then the size of the allocated extent we get a stack trace
    like the following:

    ------------[ cut here ]------------
    WARNING: CPU: 0 PID: 20138 at fs/btrfs/space-info.h:115 btrfs_add_reserved_bytes+0x3d6/0x4e0 [btrfs]
    Modules linked in: btrfs blake2b_generic xor raid6_pq libcrc32c (...)
    CPU: 0 PID: 20138 Comm: kworker/u8:15 Not tainted 5.6.0-rc7-btrfs-next-58 #5
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
    Workqueue: writeback wb_workfn (flush-btrfs-1754)
    RIP: 0010:btrfs_add_reserved_bytes+0x3d6/0x4e0 [btrfs]
    Code: ff ff 48 (...)
    RSP: 0018:ffffbda18a4b3568 EFLAGS: 00010287
    RAX: 0000000000000000 RBX: ffff9ca076f5d800 RCX: 0000000000000000
    RDX: 0000000000000002 RSI: 0000000000000000 RDI: ffff9ca068470410
    RBP: fffffffffffff000 R08: 0000000000000001 R09: 0000000000000000
    R10: ffff9ca079d58040 R11: 0000000000000000 R12: ffff9ca068470400
    R13: ffff9ca0408b2000 R14: 0000000000001000 R15: ffff9ca076f5d800
    FS: 0000000000000000(0000) GS:ffff9ca07a600000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00005605dbfe7048 CR3: 0000000138570006 CR4: 00000000003606f0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    find_free_extent+0x4a0/0x16c0 [btrfs]
    btrfs_reserve_extent+0x91/0x180 [btrfs]
    cow_file_range+0x12d/0x490 [btrfs]
    run_delalloc_nocow+0x341/0xa40 [btrfs]
    btrfs_run_delalloc_range+0x1ea/0x6d0 [btrfs]
    ? find_lock_delalloc_range+0x221/0x250 [btrfs]
    writepage_delalloc+0xe8/0x150 [btrfs]
    __extent_writepage+0xe8/0x4c0 [btrfs]
    extent_write_cache_pages+0x237/0x530 [btrfs]
    ? btrfs_wq_submit_bio+0x9f/0xc0 [btrfs]
    extent_writepages+0x44/0xa0 [btrfs]
    do_writepages+0x23/0x80
    __writeback_single_inode+0x59/0x700
    writeback_sb_inodes+0x267/0x5f0
    __writeback_inodes_wb+0x87/0xe0
    wb_writeback+0x382/0x590
    ? wb_workfn+0x4a2/0x6c0
    wb_workfn+0x4a2/0x6c0
    process_one_work+0x26d/0x6a0
    worker_thread+0x4f/0x3e0
    ? process_one_work+0x6a0/0x6a0
    kthread+0x103/0x140
    ? kthread_create_worker_on_cpu+0x70/0x70
    ret_from_fork+0x3a/0x50
    irq event stamp: 0
    hardirqs last enabled at (0): [] 0x0
    hardirqs last disabled at (0): [] copy_process+0x74f/0x2020
    softirqs last enabled at (0): [] copy_process+0x74f/0x2020
    softirqs last disabled at (0): [] 0x0
    ---[ end trace f9f6ef8ec4cd8ec9 ]---

    So to fix this, when falling back into cow mode check if space was not
    reserved, by testing for the bit EXTENT_NORESERVE in the respective file
    range, and if not, increment the bytes_may_use counter for the data
    space_info object. Also clear the EXTENT_NORESERVE bit from the range, so
    that if the cow path fails it decrements the bytes_may_use counter when
    clearing the delalloc range (through the btrfs_clear_delalloc_extent()
    callback).

    Fixes: 7ee9e4405f264e ("Btrfs: check if we can nocow if we don't have data space")
    CC: stable@vger.kernel.org # 4.4+
    Signed-off-by: Filipe Manana
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Filipe Manana
     
  • commit e2c8e92d1140754073ad3799eb6620c76bab2078 upstream.

    If an error happens while running dellaloc in COW mode for a range, we can
    end up calling extent_clear_unlock_delalloc() for a range that goes beyond
    our range's end offset by 1 byte, which affects 1 extra page. This results
    in clearing bits and doing page operations (such as a page unlock) outside
    our target range.

    Fix that by calling extent_clear_unlock_delalloc() with an inclusive end
    offset, instead of an exclusive end offset, at cow_file_range().

    Fixes: a315e68f6e8b30 ("Btrfs: fix invalid attempt to free reserved space on failure to cow range")
    CC: stable@vger.kernel.org # 4.14+
    Signed-off-by: Filipe Manana
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Filipe Manana
     
  • commit 6d3113a193e3385c72240096fe397618ecab6e43 upstream.

    In btrfs_submit_direct_hook(), if a direct I/O write doesn't span a RAID
    stripe or chunk, we submit orig_bio without cloning it. In this case, we
    don't increment pending_bios. Then, if btrfs_submit_dio_bio() fails, we
    decrement pending_bios to -1, and we never complete orig_bio. Fix it by
    initializing pending_bios to 1 instead of incrementing later.

    Fixing this exposes another bug: we put orig_bio prematurely and then
    put it again from end_io. Fix it by not putting orig_bio.

    After this change, pending_bios is really more of a reference count, but
    I'll leave that cleanup separate to keep the fix small.

    Fixes: e65e15355429 ("btrfs: fix panic caused by direct IO")
    CC: stable@vger.kernel.org # 4.4+
    Reviewed-by: Nikolay Borisov
    Reviewed-by: Josef Bacik
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Omar Sandoval
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Omar Sandoval
     
  • commit 9c343784c4328781129bcf9e671645f69fe4b38a upstream.

    Nikolay noticed a bunch of test failures with my global rsv steal
    patches. At first he thought they were introduced by them, but they've
    been failing for a while with 64k nodes.

    The problem is with 64k nodes we have a global reserve that calculates
    out to 13MiB on a freshly made file system, which only has 8MiB of
    metadata space. Because of changes I previously made we no longer
    account for the global reserve in the overcommit logic, which means we
    correctly allow overcommit to happen even though we are already
    overcommitted.

    However in some corner cases, for example btrfs/170, we will allocate
    the entire file system up with data chunks before we have enough space
    pressure to allocate a metadata chunk. Then once the fs is full we
    ENOSPC out because we cannot overcommit and the global reserve is taking
    up all of the available space.

    The most ideal way to deal with this is to change our space reservation
    stuff to take into account the height of the tree's that we're
    modifying, so that our global reserve calculation does not end up so
    obscenely large.

    However that is a huge undertaking. Instead fix this by forcing a chunk
    allocation if the global reserve is larger than the total metadata
    space. This gives us essentially the same behavior that happened
    before, we get a chunk allocated and these tests can pass.

    This is meant to be a stop-gap measure until we can tackle the "tree
    height only" project.

    Fixes: 0096420adb03 ("btrfs: do not account global reserve in can_overcommit")
    CC: stable@vger.kernel.org # 5.4+
    Reviewed-by: Nikolay Borisov
    Tested-by: Nikolay Borisov
    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Josef Bacik
     
  • commit 89efda52e6b6930f80f5adda9c3c9edfb1397191 upstream.

    Whenever a chown is executed, all capabilities of the file being touched
    are lost. When doing incremental send with a file with capabilities,
    there is a situation where the capability can be lost on the receiving
    side. The sequence of actions bellow shows the problem:

    $ mount /dev/sda fs1
    $ mount /dev/sdb fs2

    $ touch fs1/foo.bar
    $ setcap cap_sys_nice+ep fs1/foo.bar
    $ btrfs subvolume snapshot -r fs1 fs1/snap_init
    $ btrfs send fs1/snap_init | btrfs receive fs2

    $ chgrp adm fs1/foo.bar
    $ setcap cap_sys_nice+ep fs1/foo.bar

    $ btrfs subvolume snapshot -r fs1 fs1/snap_complete
    $ btrfs subvolume snapshot -r fs1 fs1/snap_incremental

    $ btrfs send fs1/snap_complete | btrfs receive fs2
    $ btrfs send -p fs1/snap_init fs1/snap_incremental | btrfs receive fs2

    At this point, only a chown was emitted by "btrfs send" since only the
    group was changed. This makes the cap_sys_nice capability to be dropped
    from fs2/snap_incremental/foo.bar

    To fix that, only emit capabilities after chown is emitted. The current
    code first checks for xattrs that are new/changed, emits them, and later
    emit the chown. Now, __process_new_xattr skips capabilities, letting
    only finish_inode_if_needed to emit them, if they exist, for the inode
    being processed.

    This behavior was being worked around in "btrfs receive" side by caching
    the capability and only applying it after chown. Now, xattrs are only
    emmited _after_ chown, making that workaround not needed anymore.

    Link: https://github.com/kdave/btrfs-progs/issues/202
    CC: stable@vger.kernel.org # 4.4+
    Suggested-by: Filipe Manana
    Reviewed-by: Filipe Manana
    Signed-off-by: Marcos Paulo de Souza
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Marcos Paulo de Souza
     
  • commit 998a0671961f66e9fad4990ed75f80ba3088c2f1 upstream.

    btrfs_free_extra_devids() updates fs_devices::latest_bdev to point to
    the bdev with greatest device::generation number. For a typical-missing
    device the generation number is zero so fs_devices::latest_bdev will
    never point to it.

    But if the missing device is due to alienation [1], then
    device::generation is not zero and if it is greater or equal to the rest
    of device generations in the list, then fs_devices::latest_bdev ends up
    pointing to the missing device and reports the error like [2].

    [1] We maintain devices of a fsid (as in fs_device::fsid) in the
    fs_devices::devices list, a device is considered as an alien device
    if its fsid does not match with the fs_device::fsid

    Consider a working filesystem with raid1:

    $ mkfs.btrfs -f -d raid1 -m raid1 /dev/sda /dev/sdb
    $ mount /dev/sda /mnt-raid1
    $ umount /mnt-raid1

    While mnt-raid1 was unmounted the user force-adds one of its devices to
    another btrfs filesystem:

    $ mkfs.btrfs -f /dev/sdc
    $ mount /dev/sdc /mnt-single
    $ btrfs dev add -f /dev/sda /mnt-single

    Now the original mnt-raid1 fails to mount in degraded mode, because
    fs_devices::latest_bdev is pointing to the alien device.

    $ mount -o degraded /dev/sdb /mnt-raid1

    [2]
    mount: wrong fs type, bad option, bad superblock on /dev/sdb,
    missing codepage or helper program, or other error

    In some cases useful info is found in syslog - try
    dmesg | tail or so.

    kernel: BTRFS warning (device sdb): devid 1 uuid 072a0192-675b-4d5a-8640-a5cf2b2c704d is missing
    kernel: BTRFS error (device sdb): failed to read devices
    kernel: BTRFS error (device sdb): open_ctree failed

    Fix the root cause by checking if the device is not missing before it
    can be considered for the fs_devices::latest_bdev.

    CC: stable@vger.kernel.org # 4.19+
    Reviewed-by: Josef Bacik
    Signed-off-by: Anand Jain
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Anand Jain
     
  • commit 7f551d969037cc128eca60688d9c5a300d84e665 upstream.

    When an old device has new fsid through 'btrfs device add -f ' our
    fs_devices list has an alien device in one of the fs_devices lists.

    By having an alien device in fs_devices, we have two issues so far

    1. missing device does not not show as missing in the userland

    2. degraded mount will fail

    Both issues are caused by the fact that there's an alien device in the
    fs_devices list. (Alien means that it does not belong to the filesystem,
    identified by fsid, or does not contain btrfs filesystem at all, eg. due
    to overwrite).

    A device can be scanned/added through the control device ioctls
    SCAN_DEV, DEVICES_READY or by ADD_DEV.

    And device coming through the control device is checked against the all
    other devices in the lists, but this was not the case for ADD_DEV.

    This patch fixes both issues above by removing the alien device.

    CC: stable@vger.kernel.org # 5.4+
    Signed-off-by: Anand Jain
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Anand Jain
     
  • [ Upstream commit 47227d27e2fcb01a9e8f5958d8997cf47a820afc ]

    The memcmp KASAN self-test fails on a kernel with both KASAN and
    FORTIFY_SOURCE.

    When FORTIFY_SOURCE is on, a number of functions are replaced with
    fortified versions, which attempt to check the sizes of the operands.
    However, these functions often directly invoke __builtin_foo() once they
    have performed the fortify check. Using __builtins may bypass KASAN
    checks if the compiler decides to inline it's own implementation as
    sequence of instructions, rather than emit a function call that goes out
    to a KASAN-instrumented implementation.

    Why is only memcmp affected?
    ============================

    Of the string and string-like functions that kasan_test tests, only memcmp
    is replaced by an inline sequence of instructions in my testing on x86
    with gcc version 9.2.1 20191008 (Ubuntu 9.2.1-9ubuntu2).

    I believe this is due to compiler heuristics. For example, if I annotate
    kmalloc calls with the alloc_size annotation (and disable some fortify
    compile-time checking!), the compiler will replace every memset except the
    one in kmalloc_uaf_memset with inline instructions. (I have some WIP
    patches to add this annotation.)

    Does this affect other functions in string.h?
    =============================================

    Yes. Anything that uses __builtin_* rather than __real_* could be
    affected. This looks like:

    - strncpy
    - strcat
    - strlen
    - strlcpy maybe, under some circumstances?
    - strncat under some circumstances
    - memset
    - memcpy
    - memmove
    - memcmp (as noted)
    - memchr
    - strcpy

    Whether a function call is emitted always depends on the compiler. Most
    bugs should get caught by FORTIFY_SOURCE, but the missed memcmp test shows
    that this is not always the case.

    Isn't FORTIFY_SOURCE disabled with KASAN?
    ========================================-

    The string headers on all arches supporting KASAN disable fortify with
    kasan, but only when address sanitisation is _also_ disabled. For example
    from x86:

    #if defined(CONFIG_KASAN) && !defined(__SANITIZE_ADDRESS__)
    /*
    * For files that are not instrumented (e.g. mm/slub.c) we
    * should use not instrumented version of mem* functions.
    */
    #define memcpy(dst, src, len) __memcpy(dst, src, len)
    #define memmove(dst, src, len) __memmove(dst, src, len)
    #define memset(s, c, n) __memset(s, c, n)

    #ifndef __NO_FORTIFY
    #define __NO_FORTIFY /* FORTIFY_SOURCE uses __builtin_memcpy, etc. */
    #endif

    #endif

    This comes from commit 6974f0c4555e ("include/linux/string.h: add the
    option of fortified string.h functions"), and doesn't work when KASAN is
    enabled and the file is supposed to be sanitised - as with test_kasan.c

    I'm pretty sure this is not wrong, but not as expansive it should be:

    * we shouldn't use __builtin_memcpy etc in files where we don't have
    instrumentation - it could devolve into a function call to memcpy,
    which will be instrumented. Rather, we should use __memcpy which
    by convention is not instrumented.

    * we also shouldn't be using __builtin_memcpy when we have a KASAN
    instrumented file, because it could be replaced with inline asm
    that will not be instrumented.

    What is correct behaviour?
    ==========================

    Firstly, there is some overlap between fortification and KASAN: both
    provide some level of _runtime_ checking. Only fortify provides
    compile-time checking.

    KASAN and fortify can pick up different things at runtime:

    - Some fortify functions, notably the string functions, could easily be
    modified to consider sub-object sizes (e.g. members within a struct),
    and I have some WIP patches to do this. KASAN cannot detect these
    because it cannot insert poision between members of a struct.

    - KASAN can detect many over-reads/over-writes when the sizes of both
    operands are unknown, which fortify cannot.

    So there are a couple of options:

    1) Flip the test: disable fortify in santised files and enable it in
    unsanitised files. This at least stops us missing KASAN checking, but
    we lose the fortify checking.

    2) Make the fortify code always call out to real versions. Do this only
    for KASAN, for fear of losing the inlining opportunities we get from
    __builtin_*.

    (We can't use kasan_check_{read,write}: because the fortify functions are
    _extern inline_, you can't include _static_ inline functions without a
    compiler warning. kasan_check_{read,write} are static inline so we can't
    use them even when they would otherwise be suitable.)

    Take approach 2 and call out to real versions when KASAN is enabled.

    Use __underlying_foo to distinguish from __real_foo: __real_foo always
    refers to the kernel's implementation of foo, __underlying_foo could be
    either the kernel implementation or the __builtin_foo implementation.

    This is sometimes enough to make the memcmp test succeed with
    FORTIFY_SOURCE enabled. It is at least enough to get the function call
    into the module. One more fix is needed to make it reliable: see the next
    patch.

    Fixes: 6974f0c4555e ("include/linux/string.h: add the option of fortified string.h functions")
    Signed-off-by: Daniel Axtens
    Signed-off-by: Andrew Morton
    Tested-by: David Gow
    Reviewed-by: Dmitry Vyukov
    Cc: Daniel Micay
    Cc: Andrey Ryabinin
    Cc: Alexander Potapenko
    Link: http://lkml.kernel.org/r/20200423154503.5103-3-dja@axtens.net
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Daniel Axtens
     
  • [ Upstream commit adb72ae1915db28f934e9e02c18bfcea2f3ed3b7 ]

    Patch series "Fix some incompatibilites between KASAN and FORTIFY_SOURCE", v4.

    3 KASAN self-tests fail on a kernel with both KASAN and FORTIFY_SOURCE:
    memchr, memcmp and strlen.

    When FORTIFY_SOURCE is on, a number of functions are replaced with
    fortified versions, which attempt to check the sizes of the operands.
    However, these functions often directly invoke __builtin_foo() once they
    have performed the fortify check. The compiler can detect that the
    results of these functions are not used, and knows that they have no other
    side effects, and so can eliminate them as dead code.

    Why are only memchr, memcmp and strlen affected?
    ================================================

    Of string and string-like functions, kasan_test tests:

    * strchr -> not affected, no fortified version
    * strrchr -> likewise
    * strcmp -> likewise
    * strncmp -> likewise

    * strnlen -> not affected, the fortify source implementation calls the
    underlying strnlen implementation which is instrumented, not
    a builtin

    * strlen -> affected, the fortify souce implementation calls a __builtin
    version which the compiler can determine is dead.

    * memchr -> likewise
    * memcmp -> likewise

    * memset -> not affected, the compiler knows that memset writes to its
    first argument and therefore is not dead.

    Why does this not affect the functions normally?
    ================================================

    In string.h, these functions are not marked as __pure, so the compiler
    cannot know that they do not have side effects. If relevant functions are
    marked as __pure in string.h, we see the following warnings and the
    functions are elided:

    lib/test_kasan.c: In function `kasan_memchr':
    lib/test_kasan.c:606:2: warning: statement with no effect [-Wunused-value]
    memchr(ptr, '1', size + 1);
    ^~~~~~~~~~~~~~~~~~~~~~~~~~
    lib/test_kasan.c: In function `kasan_memcmp':
    lib/test_kasan.c:622:2: warning: statement with no effect [-Wunused-value]
    memcmp(ptr, arr, size+1);
    ^~~~~~~~~~~~~~~~~~~~~~~~
    lib/test_kasan.c: In function `kasan_strings':
    lib/test_kasan.c:645:2: warning: statement with no effect [-Wunused-value]
    strchr(ptr, '1');
    ^~~~~~~~~~~~~~~~
    ...

    This annotation would make sense to add and could be added at any point,
    so the behaviour of test_kasan.c should change.

    The fix
    =======

    Make all the functions that are pure write their results to a global,
    which makes them live. The strlen and memchr tests now pass.

    The memcmp test still fails to trigger, which is addressed in the next
    patch.

    [dja@axtens.net: drop patch 3]
    Link: http://lkml.kernel.org/r/20200424145521.8203-2-dja@axtens.net
    Fixes: 0c96350a2d2f ("lib/test_kasan.c: add tests for several string/memory API functions")
    Signed-off-by: Daniel Axtens
    Signed-off-by: Andrew Morton
    Tested-by: David Gow
    Reviewed-by: Dmitry Vyukov
    Cc: Daniel Micay
    Cc: Andrey Ryabinin
    Cc: Alexander Potapenko
    Link: http://lkml.kernel.org/r/20200423154503.5103-1-dja@axtens.net
    Link: http://lkml.kernel.org/r/20200423154503.5103-2-dja@axtens.net
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Daniel Axtens
     
  • [ Upstream commit b8215dce7dfd817ca38807f55165bf502146cd68 ]

    test_flow_dissector leaves a TAP device after it's finished, potentially
    interfering with other tests that will run after it. Fix it by closing the
    TAP descriptor on cleanup.

    Fixes: 0905beec9f52 ("selftests/bpf: run flow dissector tests in skb-less mode")
    Signed-off-by: Jakub Sitnicki
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200531082846.2117903-11-jakub@cloudflare.com
    Signed-off-by: Sasha Levin

    Jakub Sitnicki
     
  • [ Upstream commit e91de6afa81c10e9f855c5695eb9a53168d96b73 ]

    KTLS uses a stream parser to collect TLS messages and send them to
    the upper layer tls receive handler. This ensures the tls receiver
    has a full TLS header to parse when it is run. However, when a
    socket has BPF_SK_SKB_STREAM_VERDICT program attached before KTLS
    is enabled we end up with two stream parsers running on the same
    socket.

    The result is both try to run on the same socket. First the KTLS
    stream parser runs and calls read_sock() which will tcp_read_sock
    which in turn calls tcp_rcv_skb(). This dequeues the skb from the
    sk_receive_queue. When this is done KTLS code then data_ready()
    callback which because we stacked KTLS on top of the bpf stream
    verdict program has been replaced with sk_psock_start_strp(). This
    will in turn kick the stream parser again and eventually do the
    same thing KTLS did above calling into tcp_rcv_skb() and dequeuing
    a skb from the sk_receive_queue.

    At this point the data stream is broke. Part of the stream was
    handled by the KTLS side some other bytes may have been handled
    by the BPF side. Generally this results in either missing data
    or more likely a "Bad Message" complaint from the kTLS receive
    handler as the BPF program steals some bytes meant to be in a
    TLS header and/or the TLS header length is no longer correct.

    We've already broke the idealized model where we can stack ULPs
    in any order with generic callbacks on the TX side to handle this.
    So in this patch we do the same thing but for RX side. We add
    a sk_psock_strp_enabled() helper so TLS can learn a BPF verdict
    program is running and add a tls_sw_has_ctx_rx() helper so BPF
    side can learn there is a TLS ULP on the socket.

    Then on BPF side we omit calling our stream parser to avoid
    breaking the data stream for the KTLS receiver. Then on the
    KTLS side we call BPF_SK_SKB_STREAM_VERDICT once the KTLS
    receiver is done with the packet but before it posts the
    msg to userspace. This gives us symmetry between the TX and
    RX halfs and IMO makes it usable again. On the TX side we
    process packets in this order BPF -> TLS -> TCP and on
    the receive side in the reverse order TCP -> TLS -> BPF.

    Discovered while testing OpenSSL 3.0 Alpha2.0 release.

    Fixes: d829e9c4112b5 ("tls: convert to generic sk_msg interface")
    Signed-off-by: John Fastabend
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/159079361946.5745.605854335665044485.stgit@john-Precision-5820-Tower
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Sasha Levin

    John Fastabend
     
  • [ Upstream commit ca2f5f21dbbd5e3a00cd3e97f728aa2ca0b2e011 ]

    We will need this block of code called from tls context shortly
    lets refactor the redirect logic so its easy to use. This also
    cleans up the switch stmt so we have fewer fallthrough cases.

    No logic changes are intended.

    Fixes: d829e9c4112b5 ("tls: convert to generic sk_msg interface")
    Signed-off-by: John Fastabend
    Signed-off-by: Alexei Starovoitov
    Reviewed-by: Jakub Sitnicki
    Acked-by: Song Liu
    Link: https://lore.kernel.org/bpf/159079360110.5745.7024009076049029819.stgit@john-Precision-5820-Tower
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Sasha Levin

    John Fastabend
     
  • [ Upstream commit 1ea0f9120c8ce105ca181b070561df5cbd6bc049 ]

    The map_lookup_and_delete_elem() function should check for both FMODE_CAN_WRITE
    and FMODE_CAN_READ permissions because it returns a map element to user space.

    Fixes: bd513cd08f10 ("bpf: add MAP_LOOKUP_AND_DELETE_ELEM syscall")
    Signed-off-by: Anton Protopopov
    Signed-off-by: Daniel Borkmann
    Link: https://lore.kernel.org/bpf/20200527185700.14658-5-a.s.protopopov@gmail.com
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Sasha Levin

    Anton Protopopov
     
  • [ Upstream commit 601b05ca6edb0422bf6ce313fbfd55ec7bbbc0fd ]

    In case the cpu_bufs are sparsely allocated they are not all
    free'ed. These changes will fix this.

    Fixes: fb84b8224655 ("libbpf: add perf buffer API")
    Signed-off-by: Eelco Chaudron
    Signed-off-by: Daniel Borkmann
    Acked-by: Andrii Nakryiko
    Link: https://lore.kernel.org/bpf/159056888305.330763.9684536967379110349.stgit@ebuild
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Sasha Levin

    Eelco Chaudron
     
  • [ Upstream commit 7b91f1565fbfbe5a162d91f8a1f6c5580c2fc1d0 ]

    On the ASUS laptop UX325JA/UX425JA, most of the media keys are not
    working due to the ASUS WMI driver fails to be loaded. The ACPI error
    as follows leads to the failure of asus_wmi_evaluate_method.
    ACPI BIOS Error (bug): AE_AML_BUFFER_LIMIT, Field [IIA3] at bit offset/length 96/32 exceeds size of target Buffer (96 bits) (20200326/dsopcode-203)
    No Local Variables are initialized for Method [WMNB]
    ACPI Error: Aborting method \_SB.ATKD.WMNB due to previous error (AE_AML_BUFFER_LIMIT) (20200326/psparse-531)

    The DSDT for the WMNB part shows that 5 DWORD required for local
    variables and the 3rd variable IIA3 hit the buffer limit.

    Method (WMNB, 3, Serialized)
    { ..
    CreateDWordField (Arg2, Zero, IIA0)
    CreateDWordField (Arg2, 0x04, IIA1)
    CreateDWordField (Arg2, 0x08, IIA2)
    CreateDWordField (Arg2, 0x0C, IIA3)
    CreateDWordField (Arg2, 0x10, IIA4)
    Local0 = (Arg1 & 0xFFFFFFFF)
    If ((Local0 == 0x54494E49))
    ..
    }

    The limitation is determined by the input acpi_buffer size passed
    to the wmi_evaluate_method. Since the struct bios_args is the data
    structure used as input buffer by default for all ASUS WMI calls,
    the size needs to be expanded to fix the problem.

    Signed-off-by: Chris Chiu
    Reviewed-by: Hans de Goede
    Signed-off-by: Andy Shevchenko
    Signed-off-by: Sasha Levin

    Chris Chiu
     
  • [ Upstream commit cfae58ed681c5fe0185db843013ecc71cd265ebf ]

    The HP Stream x360 11-p000nd no longer report SW_TABLET_MODE state / events
    with recent kernels. This model reports a chassis-type of 10 / "Notebook"
    which is not on the recently introduced chassis-type whitelist

    Commit de9647efeaa9 ("platform/x86: intel-vbtn: Only activate tablet mode
    switch on 2-in-1's") added a chassis-type whitelist and only listed 31 /
    "Convertible" as being capable of generating valid SW_TABLET_MOD events.

    Commit 1fac39fd0316 ("platform/x86: intel-vbtn: Also handle tablet-mode
    switch on "Detachable" and "Portable" chassis-types") extended the
    whitelist with chassis-types 8 / "Portable" and 32 / "Detachable".

    And now we need to exten the whitelist again with 10 / "Notebook"...

    The issue original fixed by the whitelist is really a ACPI DSDT bug on
    the Dell XPS 9360 where it has a VGBS which reports it is in tablet mode
    even though it is not a 2-in-1 at all, but a regular laptop.

    So since this is a workaround for a DSDT issue on that specific model,
    instead of extending the whitelist over and over again, lets switch to
    a blacklist and only blacklist the chassis-type of the model for which
    the chassis-type check was added.

    Note this also fixes the current version of the code no longer checking
    if dmi_get_system_info(DMI_CHASSIS_TYPE) returns NULL.

    Fixes: 1fac39fd0316 ("platform/x86: intel-vbtn: Also handle tablet-mode switch on "Detachable" and "Portable" chassis-types")
    Cc: Mario Limonciello
    Signed-off-by: Hans de Goede
    Reviewed-by: Mario Limonciello
    Signed-off-by: Andy Shevchenko
    Signed-off-by: Sasha Levin

    Hans de Goede
     
  • [ Upstream commit 8fe63eb757ac6e661a384cc760792080bdc738dc ]

    HEBC method reports capabilities of 5 button array but HP Spectre X2 (2015)
    does not have this control method (the same was for Wacom MobileStudio Pro).
    Expand previous DMI quirk by Alex Hung to also enable 5 button array
    for this system.

    Signed-off-by: Nickolai Kozachenko
    Signed-off-by: Andy Shevchenko
    Signed-off-by: Sasha Levin

    Nickolai Kozachenko
     
  • [ Upstream commit 5cdc45ed3948042f0d73c6fec5ee9b59e637d0d2 ]

    First of all, unsigned long can overflow u32 value on 64-bit machine.
    Second, simple_strtoul() doesn't check for overflow in the input.

    Convert simple_strtoul() to kstrtou32() to eliminate above issues.

    Signed-off-by: Andy Shevchenko
    Signed-off-by: Sasha Levin

    Andy Shevchenko
     
  • [ Upstream commit c343bf1ba5efcbf2266a1fe3baefec9cc82f867f ]

    kobject_init_and_add() takes reference even when it fails.
    If this function returns an error, kobject_put() must be called to
    properly clean up the memory associated with the object.

    Previous commit "b8eb718348b8" fixed a similar problem.

    Signed-off-by: Qiushi Wu
    [ rjw: Subject ]
    Signed-off-by: Rafael J. Wysocki
    Signed-off-by: Sasha Levin

    Qiushi Wu
     
  • [ Upstream commit f0410bbf7d0fb80149e3b17d11d31f5b5197873e ]

    DW APB SSI DMA-part of the driver may need to perform the requested
    SPI-transfer synchronously. In that case the dma_transfer() callback
    will return 0 as a marker of the SPI transfer being finished so the
    SPI core doesn't need to wait and may proceed with the SPI message
    trasnfers pumping procedure. This will be needed to fix the problem
    when DMA transactions are finished, but there is still data left in
    the SPI Tx/Rx FIFOs being sent/received. But for now make dma_transfer
    to return 1 as the normal dw_spi_transfer_one() method.

    Signed-off-by: Serge Semin
    Cc: Georgy Vlasov
    Cc: Ramil Zaripov
    Cc: Alexey Malahov
    Cc: Thomas Bogendoerfer
    Cc: Arnd Bergmann
    Cc: Andy Shevchenko
    Cc: Feng Tang
    Cc: Rob Herring
    Cc: linux-mips@vger.kernel.org
    Cc: devicetree@vger.kernel.org
    Link: https://lore.kernel.org/r/20200529131205.31838-3-Sergey.Semin@baikalelectronics.ru
    Signed-off-by: Mark Brown
    Signed-off-by: Sasha Levin

    Serge Semin
     
  • [ Upstream commit 1194be8c949b8190b2882ad8335a5d98aa50c735 ]

    According the RM, the bit[6~0] of register ESDHC_TUNING_CTRL is
    TUNING_START_TAP, bit[7] of this register is to disable the command
    CRC check for standard tuning. So fix it here.

    Fixes: d87fc9663688 ("mmc: sdhci-esdhc-imx: support setting tuning start point")
    Signed-off-by: Haibo Chen
    Link: https://lore.kernel.org/r/1590488522-9292-1-git-send-email-haibo.chen@nxp.com
    Signed-off-by: Ulf Hansson
    Signed-off-by: Sasha Levin

    Haibo Chen
     
  • [ Upstream commit f327236df2afc8c3c711e7e070f122c26974f4da ]

    When mvm is initialized we alloc aux station with aux queue.
    We later free the station memory when driver is stopped, but we
    never free the queue's memory, which casues a leak.

    Add a proper de-initialization of the station.

    Signed-off-by: Sharon
    Signed-off-by: Luca Coelho
    Link: https://lore.kernel.org/r/iwlwifi.20200529092401.0121c5be55e9.Id7516fbb3482131d0c9dfb51ff20b226617ddb49@changeid
    Signed-off-by: Sasha Levin

    Sharon
     
  • [ Upstream commit 3b70683fc4d68f5d915d9dc7e5ba72c732c7315c ]

    ubsan report this warning, fix it by adding a unsigned suffix.

    UBSAN: signed-integer-overflow in
    drivers/net/ethernet/intel/ixgbe/ixgbe_common.c:2246:26
    65535 * 65537 cannot be represented in type 'int'
    CPU: 21 PID: 7 Comm: kworker/u256:0 Not tainted 5.7.0-rc3-debug+ #39
    Hardware name: Huawei TaiShan 2280 V2/BC82AMDC, BIOS 2280-V2 03/27/2020
    Workqueue: ixgbe ixgbe_service_task [ixgbe]
    Call trace:
    dump_backtrace+0x0/0x3f0
    show_stack+0x28/0x38
    dump_stack+0x154/0x1e4
    ubsan_epilogue+0x18/0x60
    handle_overflow+0xf8/0x148
    __ubsan_handle_mul_overflow+0x34/0x48
    ixgbe_fc_enable_generic+0x4d0/0x590 [ixgbe]
    ixgbe_service_task+0xc20/0x1f78 [ixgbe]
    process_one_work+0x8f0/0xf18
    worker_thread+0x430/0x6d0
    kthread+0x218/0x238
    ret_from_fork+0x10/0x18

    Reported-by: Hulk Robot
    Signed-off-by: Xie XiuQi
    Tested-by: Andrew Bowers
    Signed-off-by: Jeff Kirsher
    Signed-off-by: Sasha Levin

    Xie XiuQi
     
  • [ Upstream commit bc3a024101ca497bea4c69be4054c32a5c349f1d ]

    If ice_init_interrupt_scheme fails, ice_probe will jump to clearing up
    the interrupts. This can lead to some static analysis tools such as the
    compiler sanitizers complaining about double free problems.

    Since ice_init_interrupt_scheme already unrolls internally on failure,
    there is no need to call ice_clear_interrupt_scheme when it fails. Add
    a new unroll label and use that instead.

    Signed-off-by: Jacob Keller
    Tested-by: Andrew Bowers
    Signed-off-by: Jeff Kirsher
    Signed-off-by: Sasha Levin

    Jacob Keller
     
  • [ Upstream commit 966244ccd2919e28f25555a77f204cd1c109cad8 ]

    Using a fixed 1s timeout for all commands (and data transfers) is a bit
    problematic.

    For some commands it means waiting longer than needed for the timer to
    expire, which may not a big issue, but still. For other commands, like for
    an erase (CMD38) that uses a R1B response, may require longer timeouts than
    1s. In these cases, we may end up treating the command as it failed, while
    it just needed some more time to complete successfully.

    Fix the problem by respecting the cmd->busy_timeout, which is provided by
    the mmc core.

    Cc: Bruce Chang
    Cc: Harald Welte
    Signed-off-by: Ulf Hansson
    Link: https://lore.kernel.org/r/20200414161413.3036-17-ulf.hansson@linaro.org
    Signed-off-by: Sasha Levin

    Ulf Hansson
     
  • [ Upstream commit a389087ee9f195fcf2f31cd771e9ec5f02c16650 ]

    Using a fixed 1s timeout for all commands is a bit problematic.

    For some commands it means waiting longer than needed for the timeout to
    expire, which may not a big issue, but still. For other commands, like for
    an erase (CMD38) that uses a R1B response, may require longer timeouts than
    1s. In these cases, we may end up treating the command as it failed, while
    it just needed some more time to complete successfully.

    Fix the problem by respecting the cmd->busy_timeout, which is provided by
    the mmc core.

    Cc: Rui Miguel Silva
    Cc: Johan Hovold
    Cc: Alex Elder
    Cc: Greg Kroah-Hartman
    Cc: greybus-dev@lists.linaro.org
    Signed-off-by: Ulf Hansson
    Acked-by: Rui Miguel Silva
    Acked-by: Greg Kroah-Hartman
    Link: https://lore.kernel.org/r/20200414161413.3036-20-ulf.hansson@linaro.org
    Signed-off-by: Ulf Hansson
    Signed-off-by: Sasha Levin

    Ulf Hansson
     
  • [ Upstream commit d863cb03fb2aac07f017b2a1d923cdbc35021280 ]

    sdhci-msm can support auto cmd12.
    So enable SDHCI_QUIRK_MULTIBLOCK_READ_ACMD12 quirk.

    Signed-off-by: Veerabhadrarao Badiganti
    Acked-by: Adrian Hunter
    Link: https://lore.kernel.org/r/1587363626-20413-3-git-send-email-vbadigan@codeaurora.org
    Signed-off-by: Ulf Hansson
    Signed-off-by: Sasha Levin

    Veerabhadrarao Badiganti
     
  • [ Upstream commit 86da9f736740eba602389908574dfbb0f517baa5 ]

    The problematic code piece in bcache_device_free() is,

    785 static void bcache_device_free(struct bcache_device *d)
    786 {
    787 struct gendisk *disk = d->disk;
    [snipped]
    799 if (disk) {
    800 if (disk->flags & GENHD_FL_UP)
    801 del_gendisk(disk);
    802
    803 if (disk->queue)
    804 blk_cleanup_queue(disk->queue);
    805
    806 ida_simple_remove(&bcache_device_idx,
    807 first_minor_to_idx(disk->first_minor));
    808 put_disk(disk);
    809 }
    [snipped]
    816 }

    At line 808, put_disk(disk) may encounter kobject refcount of 'disk'
    being underflow.

    Here is how to reproduce the issue,
    - Attche the backing device to a cache device and do random write to
    make the cache being dirty.
    - Stop the bcache device while the cache device has dirty data of the
    backing device.
    - Only register the backing device back, NOT register cache device.
    - The bcache device node /dev/bcache0 won't show up, because backing
    device waits for the cache device shows up for the missing dirty
    data.
    - Now echo 1 into /sys/fs/bcache/pendings_cleanup, to stop the pending
    backing device.
    - After the pending backing device stopped, use 'dmesg' to check kernel
    message, a use-after-free warning from KASA reported the refcount of
    kobject linked to the 'disk' is underflow.

    The dropping refcount at line 808 in the above code piece is added by
    add_disk(d->disk) in bch_cached_dev_run(). But in the above condition
    the cache device is not registered, bch_cached_dev_run() has no chance
    to be called and the refcount is not added. The put_disk() for a non-
    added refcount of gendisk kobject triggers a underflow warning.

    This patch checks whether GENHD_FL_UP is set in disk->flags, if it is
    not set then the bcache device was not added, don't call put_disk()
    and the the underflow issue can be avoided.

    Signed-off-by: Coly Li
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Coly Li