11 Oct, 2019

2 commits

  • ptrace_stop() does preempt_enable_no_resched() to avoid the preemption,
    but after that cgroup_enter_frozen() does spin_lock/unlock and this adds
    another preemption point.

    Reported-and-tested-by: Bruce Ashfield
    Fixes: 76f969e8948d ("cgroup: cgroup v2 freezer")
    Cc: stable@vger.kernel.org # v5.2+
    Signed-off-by: Oleg Nesterov
    Acked-by: Roman Gushchin
    Signed-off-by: Tejun Heo

    Oleg Nesterov
     
  • Pull xfs fixes from Darrick Wong:
    "A couple of small code cleanups and bug fixes for rounding errors,
    metadata logging errors, and an extra layer of safeguards against
    leaking memory contents.

    - Fix a rounding error in the fallocate code

    - Minor code cleanups

    - Make sure to zero memory buffers before formatting metadata blocks

    - Fix a few places where we forgot to log an inode metadata update

    - Remove broken error handling that tried to clean up after a failure
    but still got it wrong"

    * tag 'xfs-5.4-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
    xfs: move local to extent inode logging into bmap helper
    xfs: remove broken error handling on failed attr sf to leaf change
    xfs: log the inode on directory sf to block format change
    xfs: assure zeroed memory buffers for certain kmem allocations
    xfs: removed unused error variable from xchk_refcountbt_rec
    xfs: remove unused flags arg from xfs_get_aghdr_buf()
    xfs: Fix tail rounding in xfs_alloc_file_space()

    Linus Torvalds
     

10 Oct, 2019

10 commits

  • Pull crypto fixes from Herbert Xu:
    "Fix build issues in arm/aes-ce"

    * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
    crypto: arm/aes-ce - add dependency on AES library
    crypto: arm/aes-ce - build for v8 architecture explicitly

    Linus Torvalds
     
  • Pull btrfs fixes from David Sterba:
    "A few more stabitly fixes, one build warning fix.

    - fix inode allocation under NOFS context

    - fix leak in fiemap due to concurrent append writes

    - fix log-root tree updates

    - fix balance convert of single profile on 32bit architectures

    - silence false positive warning on old GCCs (code moved in rc1)"

    * tag 'for-5.4-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
    btrfs: silence maybe-uninitialized warning in clone_range
    btrfs: fix uninitialized ret in ref-verify
    btrfs: allocate new inode in NOFS context
    btrfs: fix balance convert to single on 32-bit host CPUs
    btrfs: fix incorrect updating of log root tree
    Btrfs: fix memory leak due to concurrent append writes with fiemap

    Linus Torvalds
     
  • Pull dcache_readdir() fixes from Al Viro:
    "The couple of patches you'd been OK with; no hlist conversion yet, and
    cursors are still in the list of children"

    [ Al is referring to future work to avoid some nasty O(n**2) behavior
    with the readdir cursors when you have lots of concurrent readdirs.

    This is just a fix for a race with a trivial cleanup - Linus ]

    * 'work.dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    libfs: take cursors out of list when moving past the end of directory
    Fix the locking in dcache_readdir() and friends

    Linus Torvalds
     
  • Pull mount fixes from Al Viro:
    "A couple of regressions from the mount series"

    * 'work.mount3' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    vfs: add missing blkdev_put() in get_tree_bdev()
    shmem: fix LSM options parsing

    Linus Torvalds
     
  • At the end of the v5.3 upstream kernel development cycle, Simon stepped
    down from his role as Renesas SoC maintainer.

    Remove his maintainership, git repository, and branch from the
    MAINTAINERS file, and add an entry to the CREDITS file to honor his
    work.

    Signed-off-by: Geert Uytterhoeven
    Signed-off-by: Linus Torvalds

    Geert Uytterhoeven
     
  • that eliminates the last place where we accessed the tail of ->d_subdirs

    Signed-off-by: Al Viro

    Al Viro
     
  • Is there are a couple of missing blkdev_put() in get_tree_bdev()?

    Signed-off-by: Al Viro

    Ian Kent
     
  • ->parse_monolithic() there forgets to call security_sb_eat_lsm_opts()

    Signed-off-by: Al Viro

    Al Viro
     
  • Pull rdma fixes from Jason Gunthorpe:
    "The usual collection of driver bug fixes, and a few regressions from
    the merge window. Nothing particularly worrisome.

    - Various missed memory frees and error unwind bugs

    - Fix regressions in a few iwarp drivers from 5.4 patches

    - A few regressions added in past kernels

    - Squash a number of races in mlx5 ODP code"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
    RDMA/mlx5: Add missing synchronize_srcu() for MW cases
    RDMA/mlx5: Put live in the correct place for ODP MRs
    RDMA/mlx5: Order num_pending_prefetch properly with synchronize_srcu
    RDMA/odp: Lift umem_mutex out of ib_umem_odp_unmap_dma_pages()
    RDMA/mlx5: Fix a race with mlx5_ib_update_xlt on an implicit MR
    RDMA/mlx5: Do not allow rereg of a ODP MR
    IB/core: Fix wrong iterating on ports
    RDMA/nldev: Reshuffle the code to avoid need to rebind QP in error path
    RDMA/cxgb4: Do not dma memory off of the stack
    RDMA/cm: Fix memory leak in cm_add/remove_one
    RDMA/core: Fix an error handling path in 'res_get_common_doit()'
    RDMA/i40iw: Associate ibdev to netdev before IB device registration
    RDMA/iwcm: Fix a lock inversion issue
    RDMA/iw_cxgb4: fix SRQ access from dump_qp()
    RDMA/hfi1: Prevent memory leak in sdma_init
    RDMA/core: Fix use after free and refcnt leak on ndev in_device in iwarp_query_port
    RDMA/siw: Fix serialization issue in write_space()
    RDMA/vmw_pvrdma: Free SRQ only once

    Linus Torvalds
     
  • Pull arm64 fixes from Will Deacon:
    "A larger-than-usual batch of arm64 fixes for -rc3.

    The bulk of the fixes are dealing with a bunch of issues with the
    build system from the compat vDSO, which unfortunately led to some
    significant Makefile rework to manage the horrible combinations of
    toolchains that we can end up needing to drive simultaneously.

    We came close to disabling the thing entirely, but Vincenzo was quick
    to spin up some patches and I ended up picking up most of the bits
    that were left [*]. Future work will look at disentangling the header
    files properly.

    Other than that, we have some important fixes all over, including one
    papering over the miscompilation fallout from forcing
    CONFIG_OPTIMIZE_INLINING=y, which I'm still unhappy about. Harumph.

    We've still got a couple of open issues, so I'm expecting to have some
    more fixes later this cycle.

    Summary:

    - Numerous fixes to the compat vDSO build system, especially when
    combining gcc and clang

    - Fix parsing of PAR_EL1 in spurious kernel fault detection

    - Partial workaround for Neoverse-N1 erratum #1542419

    - Fix IRQ priority masking on entry from compat syscalls

    - Fix advertisment of FRINT HWCAP to userspace

    - Attempt to workaround inlining breakage with '__always_inline'

    - Fix accidental freeing of parent SVE state on fork() error path

    - Add some missing NULL pointer checks in instruction emulation init

    - Some formatting and comment fixes"

    [*] Will's final fixes were

    Reviewed-by: Vincenzo Frascino
    Tested-by: Vincenzo Frascino

    but they were already in linux-next by then and he didn't rebase
    just to add those.

    * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (21 commits)
    arm64: armv8_deprecated: Checking return value for memory allocation
    arm64: Kconfig: Make CONFIG_COMPAT_VDSO a proper Kconfig option
    arm64: vdso32: Rename COMPATCC to CC_COMPAT
    arm64: vdso32: Pass '--target' option to clang via VDSO_CAFLAGS
    arm64: vdso32: Don't use KBUILD_CPPFLAGS unconditionally
    arm64: vdso32: Move definition of COMPATCC into vdso32/Makefile
    arm64: Default to building compat vDSO with clang when CONFIG_CC_IS_CLANG
    lib: vdso: Remove CROSS_COMPILE_COMPAT_VDSO
    arm64: vdso32: Remove jump label config option in Makefile
    arm64: vdso32: Detect binutils support for dmb ishld
    arm64: vdso: Remove stale files from old assembly implementation
    arm64: vdso32: Fix broken compat vDSO build warnings
    arm64: mm: fix spurious fault detection
    arm64: ftrace: Ensure synchronisation in PLT setup for Neoverse-N1 #1542419
    arm64: Fix incorrect irqflag restore for priority masking for compat
    arm64: mm: avoid virt_to_phys(init_mm.pgd)
    arm64: cpufeature: Effectively expose FRINT capability to userspace
    arm64: Mark functions using explicit register variables as '__always_inline'
    docs: arm64: Fix indentation and doc formatting
    arm64/sve: Fix wrong free for task->thread.sve_state
    ...

    Linus Torvalds
     

09 Oct, 2019

9 commits

  • The callers of xfs_bmap_local_to_extents_empty() log the inode
    external to the function, yet this function is where the on-disk
    format value is updated. Push the inode logging down into the
    function itself to help prevent future mistakes.

    Note that internal bmap callers track the inode logging flags
    independently and thus may log the inode core twice due to this
    change. This is harmless, so leave this code around for consistency
    with the other attr fork conversion functions.

    Signed-off-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Brian Foster
     
  • xfs_attr_shortform_to_leaf() attempts to put the shortform fork back
    together after a failed attempt to convert from shortform to leaf
    format. While this code reallocates and copies back the shortform
    attr fork data, it never resets the inode format field back to local
    format. Further, now that the inode is properly logged after the
    initial switch from local format, any error that triggers the
    recovery code will eventually abort the transaction and shutdown the
    fs. Therefore, remove the broken and unnecessary error handling
    code.

    Signed-off-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Brian Foster
     
  • When a directory changes from shortform (sf) to block format, the sf
    format is copied to a temporary buffer, the inode format is modified
    and the updated format filled with the dentries from the temporary
    buffer. If the inode format is modified and attempt to grow the
    inode fails (due to I/O error, for example), it is possible to
    return an error while leaving the directory in an inconsistent state
    and with an otherwise clean transaction. This results in corruption
    of the associated directory and leads to xfs_dabuf_map() errors as
    subsequent lookups cannot accurately determine the format of the
    directory. This problem is reproduced occasionally by generic/475.

    The fundamental problem is that xfs_dir2_sf_to_block() changes the
    on-disk inode format without logging the inode. The inode is
    eventually logged by the bmapi layer in the common case, but error
    checking introduces the possibility of failing the high level
    request before this happens.

    Update both of the dir2 and attr callers of
    xfs_bmap_local_to_extents_empty() to log the inode core as
    consistent with the bmap local to extent format change codepath.
    This ensures that any subsequent errors after the format has changed
    cause the transaction to abort.

    Signed-off-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Brian Foster
     
  • …it/j.anaszewski/linux-leds

    Pull LED fixes from Jacek Anaszewski:

    - fix a leftover from earlier stage of development in the documentation
    of recently added led_compose_name() and fix old mistake in the
    documentation of led_set_brightness_sync() parameter name.

    - MAINTAINERS: add pointer to Pavel Machek's linux-leds.git tree.
    Pavel is going to take over LED tree maintainership from myself.

    * tag 'led-fixes-for-5.4-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/j.anaszewski/linux-leds:
    Add my linux-leds branch to MAINTAINERS
    leds: core: Fix leds.h structure documentation

    Linus Torvalds
     
  • Add pointer to my git tree to MAINTAINERS. I'd like to maintain
    linux-leds for-next branch for 5.5.

    Signed-off-by: Pavel Machek
    Signed-off-by: Jacek Anaszewski

    Pavel Machek
     
  • Update the leds.h structure documentation to define the
    correct arguments.

    Signed-off-by: Dan Murphy
    Signed-off-by: Jacek Anaszewski

    Dan Murphy
     
  • Pull GPIO fixes from Linus Walleij:

    - don't clear FLAG_IS_OUT when emulating open drain/source in gpiolib

    - fix up the usage of nonexclusive GPIO descriptors from device trees

    - fix the incorrect IEC offset when toggling trigger edge in the
    Spreadtrum driver

    - use the correct unit for debounce settings in the MAX77620 driver

    * tag 'gpio-v5.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio:
    gpio: max77620: Use correct unit for debounce times
    gpio: eic: sprd: Fix the incorrect EIC offset when toggling
    gpio: fix getting nonexclusive gpiods from DT
    gpiolib: don't clear FLAG_IS_OUT when emulating open-drain/open-source

    Linus Torvalds
     
  • Pull selinuxfix from Paul Moore:
    "One patch to ensure we don't copy bad memory up into userspace"

    * tag 'selinux-pr-20191007' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux:
    selinux: fix context string corruption in convert_context()

    Linus Torvalds
     
  • …/git/shuah/linux-kselftest

    Pull Kselftest fixes from Shuah Khan:
    "Fixes for existing tests and the framework.

    Cristian Marussi's patches add the ability to skip targets (tests) and
    exclude tests that didn't build from run-list. These patches improve
    the Kselftest results. Ability to skip targets helps avoid running
    tests that aren't supported in certain environments. As an example,
    bpf tests from mainline aren't supported on stable kernels and have
    dependency on bleeding edge llvm. Being able to skip bpf on systems
    that can't meet this llvm dependency will be helpful.

    Kselftest can be built and installed from the main Makefile. This
    change help simplify Kselftest use-cases which addresses request from
    users.

    Kees Cook added per test timeout support to limit individual test
    run-time"

    * tag 'linux-kselftest-5.4-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
    selftests: watchdog: Add command line option to show watchdog_info
    selftests: watchdog: Validate optional file argument
    selftests/kselftest/runner.sh: Add 45 second timeout per test
    kselftest: exclude failed TARGETS from runlist
    kselftest: add capability to skip chosen TARGETS
    selftests: Add kselftest-all and kselftest-install targets

    Linus Torvalds
     

08 Oct, 2019

19 commits

  • There are no return value checking when using kzalloc() and kcalloc() for
    memory allocation. so add it.

    Signed-off-by: Yunfeng Ye
    Signed-off-by: Will Deacon

    Yunfeng Ye
     
  • GCC throws warning message as below:

    ‘clone_src_i_size’ may be used uninitialized in this function
    [-Wmaybe-uninitialized]
    #define IS_ALIGNED(x, a) (((x) & ((typeof(x))(a) - 1)) == 0)
    ^
    fs/btrfs/send.c:5088:6: note: ‘clone_src_i_size’ was declared here
    u64 clone_src_i_size;
    ^
    The clone_src_i_size is only used as call-by-reference
    in a call to get_inode_info().

    Silence the warning by initializing clone_src_i_size to 0.

    Note that the warning is a false positive and reported by older versions
    of GCC (eg. 7.x) but not eg 9.x. As there have been numerous people, the
    patch is applied. Setting clone_src_i_size to 0 does not otherwise make
    sense and would not do any action in case the code changes in the future.

    Signed-off-by: Austin Kim
    Reviewed-by: David Sterba
    [ add note ]
    Signed-off-by: David Sterba

    Austin Kim
     
  • Merge misc fixes from Andrew Morton:
    "The usual shower of hotfixes.

    Chris's memcg patches aren't actually fixes - they're mature but a few
    niggling review issues were late to arrive.

    The ocfs2 fixes are quite old - those took some time to get reviewer
    attention.

    Subsystems affected by this patch series: ocfs2, hotfixes, mm/memcg,
    mm/slab-generic"

    * emailed patches from Andrew Morton :
    mm, sl[aou]b: guarantee natural alignment for kmalloc(power-of-two)
    mm, sl[ou]b: improve memory accounting
    mm, memcg: make scan aggression always exclude protection
    mm, memcg: make memory.emin the baseline for utilisation determination
    mm, memcg: proportional memory.{low,min} reclaim
    mm/vmpressure.c: fix a signedness bug in vmpressure_register_event()
    mm/page_alloc.c: fix a crash in free_pages_prepare()
    mm/z3fold.c: claim page in the beginning of free
    kernel/sysctl.c: do not override max_threads provided by userspace
    memcg: only record foreign writebacks with dirty pages when memcg is not disabled
    mm: fix -Wmissing-prototypes warnings
    writeback: fix use-after-free in finish_writeback_work()
    mm/memremap: drop unused SECTION_SIZE and SECTION_MASK
    panic: ensure preemption is disabled during panic()
    fs: ocfs2: fix a possible null-pointer dereference in ocfs2_info_scan_inode_alloc()
    fs: ocfs2: fix a possible null-pointer dereference in ocfs2_write_end_nolock()
    fs: ocfs2: fix possible null-pointer dereferences in ocfs2_xa_prepare_entry()
    ocfs2: clear zero in unaligned direct IO

    Linus Torvalds
     
  • In most configurations, kmalloc() happens to return naturally aligned
    (i.e. aligned to the block size itself) blocks for power of two sizes.

    That means some kmalloc() users might unknowingly rely on that
    alignment, until stuff breaks when the kernel is built with e.g.
    CONFIG_SLUB_DEBUG or CONFIG_SLOB, and blocks stop being aligned. Then
    developers have to devise workaround such as own kmem caches with
    specified alignment [1], which is not always practical, as recently
    evidenced in [2].

    The topic has been discussed at LSF/MM 2019 [3]. Adding a
    'kmalloc_aligned()' variant would not help with code unknowingly relying
    on the implicit alignment. For slab implementations it would either
    require creating more kmalloc caches, or allocate a larger size and only
    give back part of it. That would be wasteful, especially with a generic
    alignment parameter (in contrast with a fixed alignment to size).

    Ideally we should provide to mm users what they need without difficult
    workarounds or own reimplementations, so let's make the kmalloc()
    alignment to size explicitly guaranteed for power-of-two sizes under all
    configurations. What this means for the three available allocators?

    * SLAB object layout happens to be mostly unchanged by the patch. The
    implicitly provided alignment could be compromised with
    CONFIG_DEBUG_SLAB due to redzoning, however SLAB disables redzoning for
    caches with alignment larger than unsigned long long. Practically on at
    least x86 this includes kmalloc caches as they use cache line alignment,
    which is larger than that. Still, this patch ensures alignment on all
    arches and cache sizes.

    * SLUB layout is also unchanged unless redzoning is enabled through
    CONFIG_SLUB_DEBUG and boot parameter for the particular kmalloc cache.
    With this patch, explicit alignment is guaranteed with redzoning as
    well. This will result in more memory being wasted, but that should be
    acceptable in a debugging scenario.

    * SLOB has no implicit alignment so this patch adds it explicitly for
    kmalloc(). The potential downside is increased fragmentation. While
    pathological allocation scenarios are certainly possible, in my testing,
    after booting a x86_64 kernel+userspace with virtme, around 16MB memory
    was consumed by slab pages both before and after the patch, with
    difference in the noise.

    [1] https://lore.kernel.org/linux-btrfs/c3157c8e8e0e7588312b40c853f65c02fe6c957a.1566399731.git.christophe.leroy@c-s.fr/
    [2] https://lore.kernel.org/linux-fsdevel/20190225040904.5557-1-ming.lei@redhat.com/
    [3] https://lwn.net/Articles/787740/

    [akpm@linux-foundation.org: documentation fixlet, per Matthew]
    Link: http://lkml.kernel.org/r/20190826111627.7505-3-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Reviewed-by: Matthew Wilcox (Oracle)
    Acked-by: Michal Hocko
    Acked-by: Kirill A. Shutemov
    Acked-by: Christoph Hellwig
    Cc: David Sterba
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Ming Lei
    Cc: Dave Chinner
    Cc: "Darrick J . Wong"
    Cc: Christoph Hellwig
    Cc: James Bottomley
    Cc: Vlastimil Babka
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Patch series "guarantee natural alignment for kmalloc()", v2.

    This patch (of 2):

    SLOB currently doesn't account its pages at all, so in /proc/meminfo the
    Slab field shows zero. Modifying a counter on page allocation and
    freeing should be acceptable even for the small system scenarios SLOB is
    intended for. Since reclaimable caches are not separated in SLOB,
    account everything as unreclaimable.

    SLUB currently doesn't account kmalloc() and kmalloc_node() allocations
    larger than order-1 page, that are passed directly to the page
    allocator. As they also don't appear in /proc/slabinfo, it might look
    like a memory leak. For consistency, account them as well. (SLAB
    doesn't actually use page allocator directly, so no change there).

    Ideally SLOB and SLUB would be handled in separate patches, but due to
    the shared kmalloc_order() function and different kfree()
    implementations, it's easier to patch both at once to prevent
    inconsistencies.

    Link: http://lkml.kernel.org/r/20190826111627.7505-2-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Ming Lei
    Cc: Dave Chinner
    Cc: Matthew Wilcox
    Cc: "Darrick J . Wong"
    Cc: Christoph Hellwig
    Cc: James Bottomley
    Cc: Vlastimil Babka
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • This patch is an incremental improvement on the existing
    memory.{low,min} relative reclaim work to base its scan pressure
    calculations on how much protection is available compared to the current
    usage, rather than how much the current usage is over some protection
    threshold.

    This change doesn't change the experience for the user in the normal
    case too much. One benefit is that it replaces the (somewhat arbitrary)
    100% cutoff with an indefinite slope, which makes it easier to ballpark
    a memory.low value.

    As well as this, the old methodology doesn't quite apply generically to
    machines with varying amounts of physical memory. Let's say we have a
    top level cgroup, workload.slice, and another top level cgroup,
    system-management.slice. We want to roughly give 12G to
    system-management.slice, so on a 32GB machine we set memory.low to 20GB
    in workload.slice, and on a 64GB machine we set memory.low to 52GB.
    However, because these are relative amounts to the total machine size,
    while the amount of memory we want to generally be willing to yield to
    system.slice is absolute (12G), we end up putting more pressure on
    system.slice just because we have a larger machine and a larger workload
    to fill it, which seems fairly unintuitive. With this new behaviour, we
    don't end up with this unintended side effect.

    Previously the way that memory.low protection works is that if you are
    50% over a certain baseline, you get 50% of your normal scan pressure.
    This is certainly better than the previous cliff-edge behaviour, but it
    can be improved even further by always considering memory under the
    currently enforced protection threshold to be out of bounds. This means
    that we can set relatively low memory.low thresholds for variable or
    bursty workloads while still getting a reasonable level of protection,
    whereas with the previous version we may still trivially hit the 100%
    clamp. The previous 100% clamp is also somewhat arbitrary, whereas this
    one is more concretely based on the currently enforced protection
    threshold, which is likely easier to reason about.

    There is also a subtle issue with the way that proportional reclaim
    worked previously -- it promotes having no memory.low, since it makes
    pressure higher during low reclaim. This happens because we base our
    scan pressure modulation on how far memory.current is between memory.min
    and memory.low, but if memory.low is unset, we only use the overage
    method. In most cromulent configurations, this then means that we end
    up with *more* pressure than with no memory.low at all when we're in low
    reclaim, which is not really very usable or expected.

    With this patch, memory.low and memory.min affect reclaim pressure in a
    more understandable and composable way. For example, from a user
    standpoint, "protected" memory now remains untouchable from a reclaim
    aggression standpoint, and users can also have more confidence that
    bursty workloads will still receive some amount of guaranteed
    protection.

    Link: http://lkml.kernel.org/r/20190322160307.GA3316@chrisdown.name
    Signed-off-by: Chris Down
    Reviewed-by: Roman Gushchin
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Tejun Heo
    Cc: Dennis Zhou
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chris Down
     
  • Roman points out that when when we do the low reclaim pass, we scale the
    reclaim pressure relative to position between 0 and the maximum
    protection threshold.

    However, if the maximum protection is based on memory.elow, and
    memory.emin is above zero, this means we still may get binary behaviour
    on second-pass low reclaim. This is because we scale starting at 0, not
    starting at memory.emin, and since we don't scan at all below emin, we
    end up with cliff behaviour.

    This should be a fairly uncommon case since usually we don't go into the
    second pass, but it makes sense to scale our low reclaim pressure
    starting at emin.

    You can test this by catting two large sparse files, one in a cgroup
    with emin set to some moderate size compared to physical RAM, and
    another cgroup without any emin. In both cgroups, set an elow larger
    than 50% of physical RAM. The one with emin will have less page
    scanning, as reclaim pressure is lower.

    Rebase on top of and apply the same idea as what was applied to handle
    cgroup_memory=disable properly for the original proportional patch
    http://lkml.kernel.org/r/20190201045711.GA18302@chrisdown.name ("mm,
    memcg: Handle cgroup_disable=memory when getting memcg protection").

    Link: http://lkml.kernel.org/r/20190201051810.GA18895@chrisdown.name
    Signed-off-by: Chris Down
    Suggested-by: Roman Gushchin
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc: Dennis Zhou
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chris Down
     
  • cgroup v2 introduces two memory protection thresholds: memory.low
    (best-effort) and memory.min (hard protection). While they generally do
    what they say on the tin, there is a limitation in their implementation
    that makes them difficult to use effectively: that cliff behaviour often
    manifests when they become eligible for reclaim. This patch implements
    more intuitive and usable behaviour, where we gradually mount more
    reclaim pressure as cgroups further and further exceed their protection
    thresholds.

    This cliff edge behaviour happens because we only choose whether or not
    to reclaim based on whether the memcg is within its protection limits
    (see the use of mem_cgroup_protected in shrink_node), but we don't vary
    our reclaim behaviour based on this information. Imagine the following
    timeline, with the numbers the lruvec size in this zone:

    1. memory.low=1000000, memory.current=999999. 0 pages may be scanned.
    2. memory.low=1000000, memory.current=1000000. 0 pages may be scanned.
    3. memory.low=1000000, memory.current=1000001. 1000001* pages may be
    scanned. (?!)

    * Of course, we won't usually scan all available pages in the zone even
    without this patch because of scan control priority, over-reclaim
    protection, etc. However, as shown by the tests at the end, these
    techniques don't sufficiently throttle such an extreme change in input,
    so cliff-like behaviour isn't really averted by their existence alone.

    Here's an example of how this plays out in practice. At Facebook, we are
    trying to protect various workloads from "system" software, like
    configuration management tools, metric collectors, etc (see this[0] case
    study). In order to find a suitable memory.low value, we start by
    determining the expected memory range within which the workload will be
    comfortable operating. This isn't an exact science -- memory usage deemed
    "comfortable" will vary over time due to user behaviour, differences in
    composition of work, etc, etc. As such we need to ballpark memory.low,
    but doing this is currently problematic:

    1. If we end up setting it too low for the workload, it won't have
    *any* effect (see discussion above). The group will receive the full
    weight of reclaim and won't have any priority while competing with the
    less important system software, as if we had no memory.low configured
    at all.

    2. Because of this behaviour, we end up erring on the side of setting
    it too high, such that the comfort range is reliably covered. However,
    protected memory is completely unavailable to the rest of the system,
    so we might cause undue memory and IO pressure there when we *know* we
    have some elasticity in the workload.

    3. Even if we get the value totally right, smack in the middle of the
    comfort zone, we get extreme jumps between no pressure and full
    pressure that cause unpredictable pressure spikes in the workload due
    to the current binary reclaim behaviour.

    With this patch, we can set it to our ballpark estimation without too much
    worry. Any undesirable behaviour, such as too much or too little reclaim
    pressure on the workload or system will be proportional to how far our
    estimation is off. This means we can set memory.low much more
    conservatively and thus waste less resources *without* the risk of the
    workload falling off a cliff if we overshoot.

    As a more abstract technical description, this unintuitive behaviour
    results in having to give high-priority workloads a large protection
    buffer on top of their expected usage to function reliably, as otherwise
    we have abrupt periods of dramatically increased memory pressure which
    hamper performance. Having to set these thresholds so high wastes
    resources and generally works against the principle of work conservation.
    In addition, having proportional memory reclaim behaviour has other
    benefits. Most notably, before this patch it's basically mandatory to set
    memory.low to a higher than desirable value because otherwise as soon as
    you exceed memory.low, all protection is lost, and all pages are eligible
    to scan again. By contrast, having a gradual ramp in reclaim pressure
    means that you now still get some protection when thresholds are exceeded,
    which means that one can now be more comfortable setting memory.low to
    lower values without worrying that all protection will be lost. This is
    important because workingset size is really hard to know exactly,
    especially with variable workloads, so at least getting *some* protection
    if your workingset size grows larger than you expect increases user
    confidence in setting memory.low without a huge buffer on top being
    needed.

    Thanks a lot to Johannes Weiner and Tejun Heo for their advice and
    assistance in thinking about how to make this work better.

    In testing these changes, I intended to verify that:

    1. Changes in page scanning become gradual and proportional instead of
    binary.

    To test this, I experimented stepping further and further down
    memory.low protection on a workload that floats around 19G workingset
    when under memory.low protection, watching page scan rates for the
    workload cgroup:

    +------------+-----------------+--------------------+--------------+
    | memory.low | test (pgscan/s) | control (pgscan/s) | % of control |
    +------------+-----------------+--------------------+--------------+
    | 21G | 0 | 0 | N/A |
    | 17G | 867 | 3799 | 23% |
    | 12G | 1203 | 3543 | 34% |
    | 8G | 2534 | 3979 | 64% |
    | 4G | 3980 | 4147 | 96% |
    | 0 | 3799 | 3980 | 95% |
    +------------+-----------------+--------------------+--------------+

    As you can see, the test kernel (with a kernel containing this
    patch) ramps up page scanning significantly more gradually than the
    control kernel (without this patch).

    2. More gradual ramp up in reclaim aggression doesn't result in
    premature OOMs.

    To test this, I wrote a script that slowly increments the number of
    pages held by stress(1)'s --vm-keep mode until a production system
    entered severe overall memory contention. This script runs in a highly
    protected slice taking up the majority of available system memory.
    Watching vmstat revealed that page scanning continued essentially
    nominally between test and control, without causing forward reclaim
    progress to become arrested.

    [0]: https://facebookmicrosites.github.io/cgroup2/docs/overview.html#case-study-the-fbtax2-project

    [akpm@linux-foundation.org: reflow block comments to fit in 80 cols]
    [chris@chrisdown.name: handle cgroup_disable=memory when getting memcg protection]
    Link: http://lkml.kernel.org/r/20190201045711.GA18302@chrisdown.name
    Link: http://lkml.kernel.org/r/20190124014455.GA6396@chrisdown.name
    Signed-off-by: Chris Down
    Acked-by: Johannes Weiner
    Reviewed-by: Roman Gushchin
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc: Dennis Zhou
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chris Down
     
  • The "mode" and "level" variables are enums and in this context GCC will
    treat them as unsigned ints so the error handling is never triggered.

    I also removed the bogus initializer because it isn't required any more
    and it's sort of confusing.

    [akpm@linux-foundation.org: reduce implicit and explicit typecasting]
    [akpm@linux-foundation.org: fix return value, add comment, per Matthew]
    Link: http://lkml.kernel.org/r/20190925110449.GO3264@mwanda
    Fixes: 3cadfa2b9497 ("mm/vmpressure.c: convert to use match_string() helper")
    Signed-off-by: Dan Carpenter
    Reviewed-by: Andy Shevchenko
    Acked-by: David Rientjes
    Reviewed-by: Matthew Wilcox
    Cc: Greg Kroah-Hartman
    Cc: Thomas Gleixner
    Cc: Enrico Weigelt
    Cc: Kate Stewart
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Carpenter
     
  • On architectures like s390, arch_free_page() could mark the page unused
    (set_page_unused()) and any access later would trigger a kernel panic.
    Fix it by moving arch_free_page() after all possible accessing calls.

    Hardware name: IBM 2964 N96 400 (z/VM 6.4.0)
    Krnl PSW : 0404e00180000000 0000000026c2b96e (__free_pages_ok+0x34e/0x5d8)
    R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
    Krnl GPRS: 0000000088d43af7 0000000000484000 000000000000007c 000000000000000f
    000003d080012100 000003d080013fc0 0000000000000000 0000000000100000
    00000000275cca48 0000000000000100 0000000000000008 000003d080010000
    00000000000001d0 000003d000000000 0000000026c2b78a 000000002717fdb0
    Krnl Code: 0000000026c2b95c: ec1100b30659 risbgn %r1,%r1,0,179,6
    0000000026c2b962: e32014000036 pfd 2,1024(%r1)
    #0000000026c2b968: d7ff10001000 xc 0(256,%r1),0(%r1)
    >0000000026c2b96e: 41101100 la %r1,256(%r1)
    0000000026c2b972: a737fff8 brctg %r3,26c2b962
    0000000026c2b976: d7ff10001000 xc 0(256,%r1),0(%r1)
    0000000026c2b97c: e31003400004 lg %r1,832
    0000000026c2b982: ebff1430016a asi 5168(%r1),-1
    Call Trace:
    __free_pages_ok+0x16a/0x5d8)
    memblock_free_all+0x206/0x290
    mem_init+0x58/0x120
    start_kernel+0x2b0/0x570
    startup_continue+0x6a/0xc0
    INFO: lockdep is turned off.
    Last Breaking-Event-Address:
    __free_pages_ok+0x372/0x5d8
    Kernel panic - not syncing: Fatal exception: panic_on_oops
    00: HCPGIR450W CP entered; disabled wait PSW 00020001 80000000 00000000 26A2379C

    In the past, only kernel_poison_pages() would trigger this but it needs
    "page_poison=on" kernel cmdline, and I suspect nobody tested that on
    s390. Recently, kernel_init_free_pages() (commit 6471384af2a6 ("mm:
    security: introduce init_on_alloc=1 and init_on_free=1 boot options"))
    was added and could trigger this as well.

    [akpm@linux-foundation.org: add comment]
    Link: http://lkml.kernel.org/r/1569613623-16820-1-git-send-email-cai@lca.pw
    Fixes: 8823b1dbc05f ("mm/page_poison.c: enable PAGE_POISONING as a separate option")
    Fixes: 6471384af2a6 ("mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options")
    Signed-off-by: Qian Cai
    Reviewed-by: Heiko Carstens
    Acked-by: Christian Borntraeger
    Acked-by: Michal Hocko
    Cc: "Kirill A. Shutemov"
    Cc: Vasily Gorbik
    Cc: Alexander Duyck
    Cc: [5.3+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Qian Cai
     
  • There's a really hard to reproduce race in z3fold between z3fold_free()
    and z3fold_reclaim_page(). z3fold_reclaim_page() can claim the page
    after z3fold_free() has checked if the page was claimed and
    z3fold_free() will then schedule this page for compaction which may in
    turn lead to random page faults (since that page would have been
    reclaimed by then).

    Fix that by claiming page in the beginning of z3fold_free() and not
    forgetting to clear the claim in the end.

    [vitalywool@gmail.com: v2]
    Link: http://lkml.kernel.org/r/20190928113456.152742cf@bigdell
    Link: http://lkml.kernel.org/r/20190926104844.4f0c6efa1366b8f5741eaba9@gmail.com
    Signed-off-by: Vitaly Wool
    Reported-by: Markus Linnala
    Cc: Dan Streetman
    Cc: Vlastimil Babka
    Cc: Henry Burns
    Cc: Shakeel Butt
    Cc: Markus Linnala
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vitaly Wool
     
  • Partially revert 16db3d3f1170 ("kernel/sysctl.c: threads-max observe
    limits") because the patch is causing a regression to any workload which
    needs to override the auto-tuning of the limit provided by kernel.

    set_max_threads is implementing a boot time guesstimate to provide a
    sensible limit of the concurrently running threads so that runaways will
    not deplete all the memory. This is a good thing in general but there
    are workloads which might need to increase this limit for an application
    to run (reportedly WebSpher MQ is affected) and that is simply not
    possible after the mentioned change. It is also very dubious to
    override an admin decision by an estimation that doesn't have any direct
    relation to correctness of the kernel operation.

    Fix this by dropping set_max_threads from sysctl_max_threads so any
    value is accepted as long as it fits into MAX_THREADS which is important
    to check because allowing more threads could break internal robust futex
    restriction. While at it, do not use MIN_THREADS as the lower boundary
    because it is also only a heuristic for automatic estimation and admin
    might have a good reason to stop new threads to be created even when
    below this limit.

    This became more severe when we switched x86 from 4k to 8k kernel
    stacks. Starting since 6538b8ea886e ("x86_64: expand kernel stack to
    16K") (3.16) we use THREAD_SIZE_ORDER = 2 and that halved the auto-tuned
    value.

    In the particular case

    3.12
    kernel.threads-max = 515561

    4.4
    kernel.threads-max = 200000

    Neither of the two values is really insane on 32GB machine.

    I am not sure we want/need to tune the max_thread value further. If
    anything the tuning should be removed altogether if proven not useful in
    general. But we definitely need a way to override this auto-tuning.

    Link: http://lkml.kernel.org/r/20190922065801.GB18814@dhcp22.suse.cz
    Fixes: 16db3d3f1170 ("kernel/sysctl.c: threads-max observe limits")
    Signed-off-by: Michal Hocko
    Reviewed-by: "Eric W. Biederman"
    Cc: Heinrich Schuchardt
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • In kdump kernel, memcg usually is disabled with 'cgroup_disable=memory'
    for saving memory. Now kdump kernel will always panic when dump vmcore
    to local disk:

    BUG: kernel NULL pointer dereference, address: 0000000000000ab8
    Oops: 0000 [#1] SMP NOPTI
    CPU: 0 PID: 598 Comm: makedumpfile Not tainted 5.3.0+ #26
    Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 10/02/2018
    RIP: 0010:mem_cgroup_track_foreign_dirty_slowpath+0x38/0x140
    Call Trace:
    __set_page_dirty+0x52/0xc0
    iomap_set_page_dirty+0x50/0x90
    iomap_write_end+0x6e/0x270
    iomap_write_actor+0xce/0x170
    iomap_apply+0xba/0x11e
    iomap_file_buffered_write+0x62/0x90
    xfs_file_buffered_aio_write+0xca/0x320 [xfs]
    new_sync_write+0x12d/0x1d0
    vfs_write+0xa5/0x1a0
    ksys_write+0x59/0xd0
    do_syscall_64+0x59/0x1e0
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    And this will corrupt the 1st kernel too with 'cgroup_disable=memory'.

    Via the trace and with debugging, it is pointing to commit 97b27821b485
    ("writeback, memcg: Implement foreign dirty flushing") which introduced
    this regression. Disabling memcg causes the null pointer dereference at
    uninitialized data in function mem_cgroup_track_foreign_dirty_slowpath().

    Fix it by returning directly if memcg is disabled, but not trying to
    record the foreign writebacks with dirty pages.

    Link: http://lkml.kernel.org/r/20190924141928.GD31919@MiWiFi-R3L-srv
    Fixes: 97b27821b485 ("writeback, memcg: Implement foreign dirty flushing")
    Signed-off-by: Baoquan He
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Jan Kara
    Cc: Tejun Heo
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Baoquan He
     
  • We get two warnings when build kernel W=1:

    mm/shuffle.c:36:12: warning: no previous prototype for `shuffle_show' [-Wmissing-prototypes]
    mm/sparse.c:220:6: warning: no previous prototype for `subsection_mask_set' [-Wmissing-prototypes]

    Make the functions static to fix this.

    Link: http://lkml.kernel.org/r/1566978161-7293-1-git-send-email-wang.yi59@zte.com.cn
    Signed-off-by: Yi Wang
    Reviewed-by: David Hildenbrand
    Reviewed-by: Kees Cook
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yi Wang
     
  • finish_writeback_work() reads @done->waitq after decrementing
    @done->cnt. However, once @done->cnt reaches zero, @done may be freed
    (from stack) at any moment and @done->waitq can contain something
    unrelated by the time finish_writeback_work() tries to read it. This
    led to the following crash.

    "BUG: kernel NULL pointer dereference, address: 0000000000000002"
    #PF: supervisor write access in kernel mode
    #PF: error_code(0x0002) - not-present page
    PGD 0 P4D 0
    Oops: 0002 [#1] SMP DEBUG_PAGEALLOC
    CPU: 40 PID: 555153 Comm: kworker/u98:50 Kdump: loaded Not tainted
    ...
    Workqueue: writeback wb_workfn (flush-btrfs-1)
    RIP: 0010:_raw_spin_lock_irqsave+0x10/0x30
    Code: 48 89 d8 5b c3 e8 50 db 6b ff eb f4 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 53 9c 5b fa 31 c0 ba 01 00 00 00 0f b1 17 75 05 48 89 d8 5b c3 89 c6 e8 fe ca 6b ff eb f2 66 90
    RSP: 0018:ffffc90049b27d98 EFLAGS: 00010046
    RAX: 0000000000000000 RBX: 0000000000000246 RCX: 0000000000000000
    RDX: 0000000000000001 RSI: 0000000000000003 RDI: 0000000000000002
    RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001
    R10: ffff889fff407600 R11: ffff88ba9395d740 R12: 000000000000e300
    R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
    FS: 0000000000000000(0000) GS:ffff88bfdfa00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000000002 CR3: 0000000002409005 CR4: 00000000001606e0
    Call Trace:
    __wake_up_common_lock+0x63/0xc0
    wb_workfn+0xd2/0x3e0
    process_one_work+0x1f5/0x3f0
    worker_thread+0x2d/0x3d0
    kthread+0x111/0x130
    ret_from_fork+0x1f/0x30

    Fix it by reading and caching @done->waitq before decrementing
    @done->cnt.

    Link: http://lkml.kernel.org/r/20190924010631.GH2233839@devbig004.ftw2.facebook.com
    Fixes: 5b9cce4c7eb069 ("writeback: Generalize and expose wb_completion")
    Signed-off-by: Tejun Heo
    Debugged-by: Chris Mason
    Reviewed-by: Jens Axboe
    Cc: Jan Kara
    Cc: [5.2+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     
  • SECTION_SIZE and SECTION_MASK macros are not getting used anymore. But
    they do conflict with existing definitions on arm64 platform causing
    following warning during build. Lets drop these unused macros.

    mm/memremap.c:16: warning: "SECTION_MASK" redefined
    #define SECTION_MASK ~((1UL << PA_SECTION_SHIFT) - 1)
    arch/arm64/include/asm/pgtable-hwdef.h:79: note: this is the location of the previous definition
    #define SECTION_MASK (~(SECTION_SIZE-1))

    mm/memremap.c:17: warning: "SECTION_SIZE" redefined
    #define SECTION_SIZE (1UL << PA_SECTION_SHIFT)
    arch/arm64/include/asm/pgtable-hwdef.h:78: note: this is the location of the previous definition
    #define SECTION_SIZE (_AC(1, UL) << SECTION_SHIFT)

    Link: http://lkml.kernel.org/r/1569312010-31313-1-git-send-email-anshuman.khandual@arm.com
    Signed-off-by: Anshuman Khandual
    Reported-by: kbuild test robot
    Reviewed-by: David Hildenbrand
    Cc: Dan Williams
    Cc: Jason Gunthorpe
    Cc: Logan Gunthorpe
    Cc: Ira Weiny
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anshuman Khandual
     
  • Calling 'panic()' on a kernel with CONFIG_PREEMPT=y can leave the
    calling CPU in an infinite loop, but with interrupts and preemption
    enabled. From this state, userspace can continue to be scheduled,
    despite the system being "dead" as far as the kernel is concerned.

    This is easily reproducible on arm64 when booting with "nosmp" on the
    command line; a couple of shell scripts print out a periodic "Ping"
    message whilst another triggers a crash by writing to
    /proc/sysrq-trigger:

    | sysrq: Trigger a crash
    | Kernel panic - not syncing: sysrq triggered crash
    | CPU: 0 PID: 1 Comm: init Not tainted 5.2.15 #1
    | Hardware name: linux,dummy-virt (DT)
    | Call trace:
    | dump_backtrace+0x0/0x148
    | show_stack+0x14/0x20
    | dump_stack+0xa0/0xc4
    | panic+0x140/0x32c
    | sysrq_handle_reboot+0x0/0x20
    | __handle_sysrq+0x124/0x190
    | write_sysrq_trigger+0x64/0x88
    | proc_reg_write+0x60/0xa8
    | __vfs_write+0x18/0x40
    | vfs_write+0xa4/0x1b8
    | ksys_write+0x64/0xf0
    | __arm64_sys_write+0x14/0x20
    | el0_svc_common.constprop.0+0xb0/0x168
    | el0_svc_handler+0x28/0x78
    | el0_svc+0x8/0xc
    | Kernel Offset: disabled
    | CPU features: 0x0002,24002004
    | Memory Limit: none
    | ---[ end Kernel panic - not syncing: sysrq triggered crash ]---
    | Ping 2!
    | Ping 1!
    | Ping 1!
    | Ping 2!

    The issue can also be triggered on x86 kernels if CONFIG_SMP=n,
    otherwise local interrupts are disabled in 'smp_send_stop()'.

    Disable preemption in 'panic()' before re-enabling interrupts.

    Link: http://lkml.kernel.org/r/20191002123538.22609-1-will@kernel.org
    Link: https://lore.kernel.org/r/BX1W47JXPMR8.58IYW53H6M5N@dragonstone
    Signed-off-by: Will Deacon
    Reported-by: Xogium
    Reviewed-by: Kees Cook
    Cc: Russell King
    Cc: Greg Kroah-Hartman
    Cc: Ingo Molnar
    Cc: Petr Mladek
    Cc: Feng Tang
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Will Deacon
     
  • In ocfs2_info_scan_inode_alloc(), there is an if statement on line 283
    to check whether inode_alloc is NULL:

    if (inode_alloc)

    When inode_alloc is NULL, it is used on line 287:

    ocfs2_inode_lock(inode_alloc, &bh, 0);
    ocfs2_inode_lock_full_nested(inode, ...)
    struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);

    Thus, a possible null-pointer dereference may occur.

    To fix this bug, inode_alloc is checked on line 286.

    This bug is found by a static analysis tool STCheck written by us.

    Link: http://lkml.kernel.org/r/20190726033717.32359-1-baijiaju1990@gmail.com
    Signed-off-by: Jia-Ju Bai
    Reviewed-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Changwei Ge
    Cc: Gang He
    Cc: Jun Piao
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jia-Ju Bai
     
  • In ocfs2_write_end_nolock(), there are an if statement on lines 1976,
    2047 and 2058, to check whether handle is NULL:

    if (handle)

    When handle is NULL, it is used on line 2045:

    ocfs2_update_inode_fsync_trans(handle, inode, 1);
    oi->i_sync_tid = handle->h_transaction->t_tid;

    Thus, a possible null-pointer dereference may occur.

    To fix this bug, handle is checked before calling
    ocfs2_update_inode_fsync_trans().

    This bug is found by a static analysis tool STCheck written by us.

    Link: http://lkml.kernel.org/r/20190726033705.32307-1-baijiaju1990@gmail.com
    Signed-off-by: Jia-Ju Bai
    Reviewed-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Changwei Ge
    Cc: Gang He
    Cc: Jun Piao
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jia-Ju Bai