18 Aug, 2020

2 commits


17 Aug, 2020

3 commits

  • Linus Torvalds
     
  • Pull io_uring fixes from Jens Axboe:
    "A few differerent things in here.

    Seems like syzbot got some more io_uring bits wired up, and we got a
    handful of reports and the associated fixes are in here.

    General fixes too, and a lot of them marked for stable.

    Lastly, a bit of fallout from the async buffered reads, where we now
    more easily trigger short reads. Some applications don't really like
    that, so the io_read() code now handles short reads internally, and
    got a cleanup along the way so that it's now easier to read (and
    documented). We're now passing tests that failed before"

    * tag 'io_uring-5.9-2020-08-15' of git://git.kernel.dk/linux-block:
    io_uring: short circuit -EAGAIN for blocking read attempt
    io_uring: sanitize double poll handling
    io_uring: internally retry short reads
    io_uring: retain iov_iter state over io_read/io_write calls
    task_work: only grab task signal lock when needed
    io_uring: enable lookup of links holding inflight files
    io_uring: fail poll arm on queue proc failure
    io_uring: hold 'ctx' reference around task_work queue + execute
    fs: RWF_NOWAIT should imply IOCB_NOIO
    io_uring: defer file table grabbing request cleanup for locked requests
    io_uring: add missing REQ_F_COMP_LOCKED for nested requests
    io_uring: fix recursive completion locking on oveflow flush
    io_uring: use TWA_SIGNAL for task_work uncondtionally
    io_uring: account locked memory before potential error case
    io_uring: set ctx sq/cq entry count earlier
    io_uring: Fix NULL pointer dereference in loop_rw_iter()
    io_uring: add comments on how the async buffered read retry works
    io_uring: io_async_buf_func() need not test page bit

    Linus Torvalds
     
  • Commit 1355c31eeb7e ("asm-generic: pgalloc: provide generic pmd_alloc_one()
    and pmd_free_one()") converted parisc to use generic version of
    pmd_alloc_one() but it missed the fact that parisc uses order-1 pages for
    PMD.

    Restore the original version of pmd_alloc_one() for parisc, just use
    GFP_PGTABLE_KERNEL that implies __GFP_ZERO instead of GFP_KERNEL and
    memset.

    Fixes: 1355c31eeb7e ("asm-generic: pgalloc: provide generic pmd_alloc_one() and pmd_free_one()")
    Reported-by: Meelis Roos
    Signed-off-by: Mike Rapoport
    Tested-by: Meelis Roos
    Reviewed-by: Matthew Wilcox (Oracle)
    Link: https://lkml.kernel.org/r/9f2b5ebd-e4a4-0fa1-6cd3-4b9f6892d1ad@linux.ee
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

16 Aug, 2020

10 commits

  • Pull block fixes from Jens Axboe:
    "A few fixes on the block side of things:

    - Discard granularity fix (Coly)

    - rnbd cleanups (Guoqing)

    - md error handling fix (Dan)

    - md sysfs fix (Junxiao)

    - Fix flush request accounting, which caused an IO slowdown for some
    configurations (Ming)

    - Properly propagate loop flag for partition scanning (Lennart)"

    * tag 'block-5.9-2020-08-14' of git://git.kernel.dk/linux-block:
    block: fix double account of flush request's driver tag
    loop: unset GENHD_FL_NO_PART_SCAN on LOOP_CONFIGURE
    rnbd: no need to set bi_end_io in rnbd_bio_map_kern
    rnbd: remove rnbd_dev_submit_io
    md-cluster: Fix potential error pointer dereference in resize_bitmaps()
    block: check queue's limits.discard_granularity in __blkdev_issue_discard()
    md: get sysfs entry after redundancy attr group create

    Linus Torvalds
     
  • Pull RISC-V fix from Palmer Dabbelt:
    "I collected a single fix during the merge window: we managed to break
    the early trap setup on !MMU, this fixes it"

    * tag 'riscv-for-linus-5.9-mw1' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
    riscv: Setup exception vector for nommu platform

    Linus Torvalds
     
  • Pull arch/sh updates from Rich Felker:
    "Cleanup, SECCOMP_FILTER support, message printing fixes, and other
    changes to arch/sh"

    * tag 'sh-for-5.9' of git://git.libc.org/linux-sh: (34 commits)
    sh: landisk: Add missing initialization of sh_io_port_base
    sh: bring syscall_set_return_value in line with other architectures
    sh: Add SECCOMP_FILTER
    sh: Rearrange blocks in entry-common.S
    sh: switch to copy_thread_tls()
    sh: use the generic dma coherent remap allocator
    sh: don't allow non-coherent DMA for NOMMU
    dma-mapping: consolidate the NO_DMA definition in kernel/dma/Kconfig
    sh: unexport register_trapped_io and match_trapped_io_handler
    sh: don't include in
    sh: move the ioremap implementation out of line
    sh: move ioremap_fixed details out of
    sh: remove __KERNEL__ ifdefs from non-UAPI headers
    sh: sort the selects for SUPERH alphabetically
    sh: remove -Werror from Makefiles
    sh: Replace HTTP links with HTTPS ones
    arch/sh/configs: remove obsolete CONFIG_SOC_CAMERA*
    sh: stacktrace: Remove stacktrace_ops.stack()
    sh: machvec: Modernize printing of kernel messages
    sh: pci: Modernize printing of kernel messages
    ...

    Linus Torvalds
     
  • One case was missed in the short IO retry handling, and that's hitting
    -EAGAIN on a blocking attempt read (eg from io-wq context). This is a
    problem on sockets that are marked as non-blocking when created, they
    don't carry any REQ_F_NOWAIT information to help us terminate them
    instead of perpetually retrying.

    Fixes: 227c0c9673d8 ("io_uring: internally retry short reads")
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • There's a bit of confusion on the matching pairs of poll vs double poll,
    depending on if the request is a pure poll (IORING_OP_POLL_ADD) or
    poll driven retry.

    Add io_poll_get_double() that returns the double poll waitqueue, if any,
    and io_poll_get_single() that returns the original poll waitqueue. With
    that, remove the argument to io_poll_remove_double().

    Finally ensure that wait->private is cleared once the double poll handler
    has run, so that remove knows it's already been seen.

    Cc: stable@vger.kernel.org # v5.8
    Reported-by: syzbot+7f617d4a9369028b8a2c@syzkaller.appspotmail.com
    Fixes: 18bceab101ad ("io_uring: allow POLL_ADD with double poll_wait() users")
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Pull more perf tools updates from Arnaldo Carvalho de Melo:
    "Fixes:
    - Fixes for 'perf bench numa'.

    - Always memset source before memcpy in 'perf bench mem'.

    - Quote CC and CXX for their arguments to fix build in environments
    using those variables to pass more than just the compiler names.

    - Fix module symbol processing, addressing regression detected via
    "perf test".

    - Allow multiple probes in record+script_probe_vfs_getname.sh 'perf
    test' entry.

    Improvements:
    - Add script to autogenerate socket family name id->string table from
    copy of kernel header, used so far in 'perf trace'.

    - 'perf ftrace' improvements to provide similar options for this
    utility so that one can go from 'perf record', 'perf trace', etc to
    'perf ftrace' just by changing the name of the subcommand.

    - Prefer new "sched:sched_waking" trace event when it exists in 'perf
    sched' post processing.

    - Update POWER9 metrics to utilize other metrics.

    - Fall back to querying debuginfod if debuginfo not found locally.

    Miscellaneous:
    - Sync various kvm headers with kernel sources"

    * tag 'perf-tools-2020-08-14' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux: (40 commits)
    perf ftrace: Make option description initials all capital letters
    perf build-ids: Fall back to debuginfod query if debuginfo not found
    perf bench numa: Remove dead code in parse_nodes_opt()
    perf stat: Update POWER9 metrics to utilize other metrics
    perf ftrace: Add change log
    perf: ftrace: Add set_tracing_options() to set all trace options
    perf ftrace: Add option --tid to filter by thread id
    perf ftrace: Add option -D/--delay to delay tracing
    perf: ftrace: Allow set graph depth by '--graph-opts'
    perf ftrace: Add support for trace option tracing_thresh
    perf ftrace: Add option 'verbose' to show more info for graph tracer
    perf ftrace: Add support for tracing option 'irq-info'
    perf ftrace: Add support for trace option funcgraph-irqs
    perf ftrace: Add support for trace option sleep-time
    perf ftrace: Add support for tracing option 'func_stack_trace'
    perf tools: Add general function to parse sublevel options
    perf ftrace: Add option '--inherit' to trace children processes
    perf ftrace: Show trace column header
    perf ftrace: Add option '-m/--buffer-size' to set per-cpu buffer size
    perf ftrace: Factor out function write_tracing_file_int()
    ...

    Linus Torvalds
     
  • Pull x86 fixes from Ingo Molnar:
    "Misc fixes and small updates all around the place:

    - Fix mitigation state sysfs output

    - Fix an FPU xstate/sxave code assumption bug triggered by
    Architectural LBR support

    - Fix Lightning Mountain SoC TSC frequency enumeration bug

    - Fix kexec debug output

    - Fix kexec memory range assumption bug

    - Fix a boundary condition in the crash kernel code

    - Optimize porgatory.ro generation a bit

    - Enable ACRN guests to use X2APIC mode

    - Reduce a __text_poke() IRQs-off critical section for the benefit of
    PREEMPT_RT"

    * tag 'x86-urgent-2020-08-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86/alternatives: Acquire pte lock with interrupts enabled
    x86/bugs/multihit: Fix mitigation reporting when VMX is not in use
    x86/fpu/xstate: Fix an xstate size check warning with architectural LBRs
    x86/purgatory: Don't generate debug info for purgatory.ro
    x86/tsr: Fix tsc frequency enumeration bug on Lightning Mountain SoC
    kexec_file: Correctly output debugging information for the PT_LOAD ELF header
    kexec: Improve & fix crash_exclude_mem_range() to handle overlapping ranges
    x86/crash: Correct the address boundary of function parameters
    x86/acrn: Remove redundant chars from ACRN signature
    x86/acrn: Allow ACRN guest to use X2APIC mode

    Linus Torvalds
     
  • Pull scheduler fixes from Ingo Molnar:
    "Two fixes: fix a new tracepoint's output value, and fix the formatting
    of show-state syslog printouts"

    * tag 'sched-urgent-2020-08-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched/debug: Fix the alignment of the show-state debug output
    sched: Fix use of count for nr_running tracepoint

    Linus Torvalds
     
  • Pull perf fixes from Ingo Molnar:
    "Misc fixes, an expansion of perf syscall access to CAP_PERFMON
    privileged tools, plus a RAPL HW-enablement for Intel SPR platforms"

    * tag 'perf-urgent-2020-08-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    perf/x86/rapl: Add support for Intel SPR platform
    perf/x86/rapl: Support multiple RAPL unit quirks
    perf/x86/rapl: Fix missing psys sysfs attributes
    hw_breakpoint: Remove unused __register_perf_hw_breakpoint() declaration
    kprobes: Remove show_registers() function prototype
    perf/core: Take over CAP_SYS_PTRACE creds to CAP_PERFMON capability

    Linus Torvalds
     
  • Pull locking fixlets from Ingo Molnar:
    "A documentation fix and a 'fallthrough' macro update"

    * tag 'locking-urgent-2020-08-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    futex: Convert to use the preferred 'fallthrough' macro
    Documentation/locking/locktypes: Fix a typo

    Linus Torvalds
     

15 Aug, 2020

25 commits

  • Pull 9p updates from Dominique Martinet:

    - some code cleanup

    - a couple of static analysis fixes

    - setattr: try to pick a fid associated with the file rather than the
    dentry, which might sometimes matter

    * tag '9p-for-5.9-rc1' of git://github.com/martinetd/linux:
    9p: Remove unneeded cast from memory allocation
    9p: remove unused code in 9p
    net/9p: Fix sparse endian warning in trans_fd.c
    9p: Fix memory leak in v9fs_mount
    9p: retrieve fid from file when file instance exist.

    Linus Torvalds
     
  • Pull cifs fixes from Steve French:
    "Three small cifs/smb3 fixes, one for stable fixing mkdir path with
    the 'idsfromsid' mount option"

    * tag '5.9-rc-smb3-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6:
    SMB3: Fix mkdir when idsfromsid configured on mount
    cifs: Convert to use the fallthrough macro
    cifs: Fix an error pointer dereference in cifs_mount()

    Linus Torvalds
     
  • Pull NFS client updates from Trond Myklebust:
    "Stable fixes:
    - pNFS: Don't return layout segments that are being used for I/O
    - pNFS: Don't move layout segments off the active list when being used for I/O

    Features:
    - NFS: Add support for user xattrs through the NFSv4.2 protocol
    - NFS: Allow applications to speed up readdir+statx() using AT_STATX_DONT_SYNC
    - NFSv4.0 allow nconnect for v4.0

    Bugfixes and cleanups:
    - nfs: ensure correct writeback errors are returned on close()
    - nfs: nfs_file_write() should check for writeback errors
    - nfs: Fix getxattr kernel panic and memory overflow
    - NFS: Fix the pNFS/flexfiles mirrored read failover code
    - SUNRPC: dont update timeout value on connection reset
    - freezer: Add unsafe versions of freezable_schedule_timeout_interruptible for NFS
    - sunrpc: destroy rpc_inode_cachep after unregister_filesystem"

    * tag 'nfs-for-5.9-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (32 commits)
    NFS: Fix flexfiles read failover
    fs: nfs: delete repeated words in comments
    rpc_pipefs: convert comma to semicolon
    nfs: Fix getxattr kernel panic and memory overflow
    NFS: Don't return layout segments that are in use
    NFS: Don't move layouts to plh_return_segs list while in use
    NFS: Add layout segment info to pnfs read/write/commit tracepoints
    NFS: Add tracepoints for layouterror and layoutstats.
    NFS: Report the stateid + status in trace_nfs4_layoutreturn_on_close()
    SUNRPC dont update timeout value on connection reset
    nfs: nfs_file_write() should check for writeback errors
    nfs: ensure correct writeback errors are returned on close()
    NFSv4.2: xattr cache: get rid of cache discard work queue
    NFS: remove redundant initialization of variable result
    NFSv4.0 allow nconnect for v4.0
    freezer: Add unsafe versions of freezable_schedule_timeout_interruptible for NFS
    sunrpc: destroy rpc_inode_cachep after unregister_filesystem
    NFSv4.2: add client side xattr caching.
    NFSv4.2: hook in the user extended attribute handlers
    NFSv4.2: add the extended attribute proc functions.
    ...

    Linus Torvalds
     
  • Pull edac fix from Tony Luck:
    "Fix for the ie31200 driver that missed the first pull"

    * tag 'edac_updates_for_5.9_pt2' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
    EDAC/ie31200: Fallback if host bridge device is already initialized

    Linus Torvalds
     
  • Pull devicetree fixes from Rob Herring:
    "Another round of 'allOf' removals and whitespace clean-ups of schemas"

    * tag 'devicetree-fixes-for-5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux:
    dt-bindings: Remove more cases of 'allOf' containing a '$ref'
    dt-bindings: Whitespace clean-ups in schema files

    Linus Torvalds
     
  • Pull more ACPI updates from Rafael Wysocki:
    "Add new hardware support to the ACPI driver for AMD SoCs, the x86 clk
    driver and the Designware i2c driver (changes from Akshu Agrawal and
    Pu Wen)"

    * tag 'acpi-5.9-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
    clk: x86: Support RV architecture
    ACPI: APD: Add a fmw property is_raven
    clk: x86: Change name from ST to FCH
    ACPI: APD: Change name from ST to FCH
    i2c: designware: Add device HID for Hygon I2C controller

    Linus Torvalds
     
  • Pull one more power management update from Rafael Wysocki:
    "Modify the intel_pstate driver to allow it to work in the passive mode
    with hardware-managed P-states (HWP) enabled"

    * tag 'pm-5.9-rc1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
    cpufreq: intel_pstate: Implement passive mode with HWP enabled

    Linus Torvalds
     
  • Pull MFD updates from Lee Jones:
    "Core Frameworks
    - Make better attempt at matching device with the correct OF node
    - Allow batch removal of hierarchical sub-devices

    New Drivers
    - Add STM32 Clocksource driver
    - Add support for Khadas System Control Microcontroller

    Driver Removal
    - Remove unused driver for TI's SMSC ECE1099

    New Device Support
    - Add support for Intel Emmitsburg PCH to Intel LPSS PCI
    - Add support for Intel Tiger Lake PCH-H to Intel LPSS PCI
    - Add support for Dialog DA revision to Dialog DA9063

    New Functionality
    - Add support for AXP803 to be probed by I2C

    Fix-ups
    - Numerous W=1 warning fixes
    - Device Tree changes (stm32-lptimer, gateworks-gsc, khadas,mcu, stmfx, cros-ec, j721e-system-controller)
    - Enabled Regmap 'fast I/O' in stm32-lptimer
    - Change BUG_ON to WARN_ON in arizona-core
    - Remove superfluous code/initialisation (madera, max14577)
    - Trivial formatting/spelling issues (madera-core, madera-i2c, da9055, max77693-private)
    - Switch to of_platform_populate() in sprd-sc27xx-spi
    - Expand out set/get brightness/pwm macros in lm3533-ctrlbank
    - Disable IRQs on suspend in motorola-cpcap
    - Clean-up error handling in intel_soc_pmic_mrfld
    - Ensure correct removal order of sub-devices in madera
    - Many s/HTTP/HTTPS/ link changes
    - Ensure name used with Regmap is unique in syscon

    Bug Fixes
    - Properly 'put' clock on unbind and error in arizona-core
    - Fix revision handling in da9063
    - Fix 'assignment of read-only location' error in kempld-core
    - Avoid using the Regmap API when atomic in rn5t618
    - Redefine volatile register description in rn5t618
    - Use locking to protect event handler in dln2"

    * tag 'mfd-next-5.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd: (76 commits)
    mfd: syscon: Use a unique name with regmap_config
    mfd: Replace HTTP links with HTTPS ones
    mfd: dln2: Run event handler loop under spinlock
    mfd: madera: Improve handling of regulator unbinding
    mfd: mfd-core: Add mechanism for removal of a subset of children
    mfd: intel_soc_pmic_mrfld: Simplify the return expression of intel_scu_ipc_dev_iowrite8()
    mfd: max14577: Remove redundant initialization of variable current_bits
    mfd: rn5t618: Fix caching of battery related registers
    mfd: max77693-private: Drop a duplicated word
    mfd: da9055: pdata.h: Drop a duplicated word
    mfd: rn5t618: Make restart handler atomic safe
    mfd: kempld-core: Fix 'assignment of read-only location' error
    mfd: axp20x: Allow the AXP803 to be probed by I2C
    mfd: da9063: Add support for latest DA silicon revision
    mfd: da9063: Fix revision handling to correctly select reg tables
    dt-bindings: mfd: st,stmfx: Remove I2C unit name
    dt-bindings: mfd: ti,j721e-system-controller.yaml: Add J721e system controller
    mfd: motorola-cpcap: Disable interrupt for suspend
    mfd: smsc-ece1099: Remove driver
    mfd: core: Add OF_MFD_CELL_REG() helper
    ...

    Linus Torvalds
     
  • Merge more updates from Andrew Morton:
    "Subsystems affected by this patch series: mm/hotfixes, lz4, exec,
    mailmap, mm/thp, autofs, sysctl, mm/kmemleak, mm/misc and lib"

    * emailed patches from Andrew Morton : (35 commits)
    virtio: pci: constify ioreadX() iomem argument (as in generic implementation)
    ntb: intel: constify ioreadX() iomem argument (as in generic implementation)
    rtl818x: constify ioreadX() iomem argument (as in generic implementation)
    iomap: constify ioreadX() iomem argument (as in generic implementation)
    sh: use generic strncpy()
    sh: clkfwk: remove r8/r16/r32
    include/asm-generic/vmlinux.lds.h: align ro_after_init
    mm: annotate a data race in page_zonenum()
    mm/swap.c: annotate data races for lru_rotate_pvecs
    mm/rmap: annotate a data race at tlb_flush_batched
    mm/mempool: fix a data race in mempool_free()
    mm/list_lru: fix a data race in list_lru_count_one
    mm/memcontrol: fix a data race in scan count
    mm/page_counter: fix various data races at memsw
    mm/swapfile: fix and annotate various data races
    mm/filemap.c: fix a data race in filemap_fault()
    mm/swap_state: mark various intentional data races
    mm/page_io: mark various intentional data races
    mm/frontswap: mark various intentional data races
    mm/kmemleak: silence KCSAN splats in checksum
    ...

    Linus Torvalds
     
  • The ioreadX() helpers have inconsistent interface. On some architectures
    void *__iomem address argument is a pointer to const, on some not.

    Implementations of ioreadX() do not modify the memory under the address so
    they can be converted to a "const" version for const-safety and
    consistency among architectures.

    Signed-off-by: Krzysztof Kozlowski
    Signed-off-by: Andrew Morton
    Reviewed-by: Geert Uytterhoeven
    Cc: Allen Hubbe
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Dave Jiang
    Cc: "David S. Miller"
    Cc: Geert Uytterhoeven
    Cc: Helge Deller
    Cc: Ivan Kokshaysky
    Cc: Jakub Kicinski
    Cc: "James E.J. Bottomley"
    Cc: Jason Wang
    Cc: Jon Mason
    Cc: Kalle Valo
    Cc: Matt Turner
    Cc: Michael Ellerman
    Cc: "Michael S. Tsirkin"
    Cc: Paul Mackerras
    Cc: Richard Henderson
    Cc: Rich Felker
    Cc: Yoshinori Sato
    Link: http://lkml.kernel.org/r/20200709072837.5869-5-krzk@kernel.org
    Signed-off-by: Linus Torvalds

    Krzysztof Kozlowski
     
  • The ioreadX() helpers have inconsistent interface. On some architectures
    void *__iomem address argument is a pointer to const, on some not.

    Implementations of ioreadX() do not modify the memory under the address so
    they can be converted to a "const" version for const-safety and
    consistency among architectures.

    Signed-off-by: Krzysztof Kozlowski
    Signed-off-by: Andrew Morton
    Reviewed-by: Geert Uytterhoeven
    Acked-by: Dave Jiang
    Cc: Allen Hubbe
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: "David S. Miller"
    Cc: Geert Uytterhoeven
    Cc: Helge Deller
    Cc: Ivan Kokshaysky
    Cc: Jakub Kicinski
    Cc: "James E.J. Bottomley"
    Cc: Jason Wang
    Cc: Jon Mason
    Cc: Kalle Valo
    Cc: Matt Turner
    Cc: Michael Ellerman
    Cc: "Michael S. Tsirkin"
    Cc: Paul Mackerras
    Cc: Richard Henderson
    Cc: Rich Felker
    Cc: Yoshinori Sato
    Link: http://lkml.kernel.org/r/20200709072837.5869-4-krzk@kernel.org
    Signed-off-by: Linus Torvalds

    Krzysztof Kozlowski
     
  • The ioreadX() helpers have inconsistent interface. On some architectures
    void *__iomem address argument is a pointer to const, on some not.

    Implementations of ioreadX() do not modify the memory under the address so
    they can be converted to a "const" version for const-safety and
    consistency among architectures.

    Signed-off-by: Krzysztof Kozlowski
    Signed-off-by: Andrew Morton
    Reviewed-by: Geert Uytterhoeven
    Acked-by: Kalle Valo
    Cc: Allen Hubbe
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Dave Jiang
    Cc: "David S. Miller"
    Cc: Geert Uytterhoeven
    Cc: Helge Deller
    Cc: Ivan Kokshaysky
    Cc: Jakub Kicinski
    Cc: "James E.J. Bottomley"
    Cc: Jason Wang
    Cc: Jon Mason
    Cc: Matt Turner
    Cc: Michael Ellerman
    Cc: "Michael S. Tsirkin"
    Cc: Paul Mackerras
    Cc: Richard Henderson
    Cc: Rich Felker
    Cc: Yoshinori Sato
    Link: http://lkml.kernel.org/r/20200709072837.5869-3-krzk@kernel.org
    Signed-off-by: Linus Torvalds

    Krzysztof Kozlowski
     
  • Patch series "iomap: Constify ioreadX() iomem argument", v3.

    The ioread8/16/32() and others have inconsistent interface among the
    architectures: some taking address as const, some not.

    It seems there is nothing really stopping all of them to take pointer to
    const.

    This patch (of 4):

    The ioreadX() and ioreadX_rep() helpers have inconsistent interface. On
    some architectures void *__iomem address argument is a pointer to const,
    on some not.

    Implementations of ioreadX() do not modify the memory under the address so
    they can be converted to a "const" version for const-safety and
    consistency among architectures.

    [krzk@kernel.org: sh: clk: fix assignment from incompatible pointer type for ioreadX()]
    Link: http://lkml.kernel.org/r/20200723082017.24053-1-krzk@kernel.org
    [akpm@linux-foundation.org: fix drivers/mailbox/bcm-pdc-mailbox.c]
    Link: http://lkml.kernel.org/r/202007132209.Rxmv4QyS%25lkp@intel.com

    Suggested-by: Geert Uytterhoeven
    Signed-off-by: Krzysztof Kozlowski
    Signed-off-by: Andrew Morton
    Reviewed-by: Geert Uytterhoeven
    Reviewed-by: Arnd Bergmann
    Cc: Richard Henderson
    Cc: Ivan Kokshaysky
    Cc: Matt Turner
    Cc: "James E.J. Bottomley"
    Cc: Helge Deller
    Cc: Michael Ellerman
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Yoshinori Sato
    Cc: Rich Felker
    Cc: Kalle Valo
    Cc: "David S. Miller"
    Cc: Jakub Kicinski
    Cc: Dave Jiang
    Cc: Jon Mason
    Cc: Allen Hubbe
    Cc: "Michael S. Tsirkin"
    Cc: Jason Wang
    Link: http://lkml.kernel.org/r/20200709072837.5869-1-krzk@kernel.org
    Link: http://lkml.kernel.org/r/20200709072837.5869-2-krzk@kernel.org
    Signed-off-by: Linus Torvalds

    Krzysztof Kozlowski
     
  • Current SH will get below warning at strncpy()

    In file included from ${LINUX}/arch/sh/include/asm/string.h:3,
    from ${LINUX}/include/linux/string.h:20,
    from ${LINUX}/include/linux/bitmap.h:9,
    from ${LINUX}/include/linux/nodemask.h:95,
    from ${LINUX}/include/linux/mmzone.h:17,
    from ${LINUX}/include/linux/gfp.h:6,
    from ${LINUX}/innclude/linux/slab.h:15,
    from ${LINUX}/linux/drivers/mmc/host/vub300.c:38:
    ${LINUX}/drivers/mmc/host/vub300.c: In function 'new_system_port_status':
    ${LINUX}/arch/sh/include/asm/string_32.h:51:42: warning: array subscript\
    80 is above array bounds of 'char[26]' [-Warray-bounds]
    : "0" (__dest), "1" (__src), "r" (__src+__n)
    ~~~~~^~~~

    In general, strncpy() should behave like below.

    char dest[10];
    char *src = "12345";

    strncpy(dest, src, 10);
    // dest = {'1', '2', '3', '4', '5',
    '\0','\0','\0','\0','\0'}

    But, current SH strnpy() has 2 issues.
    1st is it will access to out-of-memory (= src + 10).
    2nd is it needs big fixup for it, and maintenance __asm__
    code is difficult.

    To solve these issues, this patch simply uses generic strncpy()
    instead of architecture specific one.

    Signed-off-by: Kuninori Morimoto
    Signed-off-by: Andrew Morton
    Cc: Alan Modra
    Cc: Bin Meng
    Cc: Chen Zhou
    Cc: Geert Uytterhoeven
    Cc: John Paul Adrian Glaubitz
    Cc: Krzysztof Kozlowski
    Cc: Rich Felker
    Cc: Romain Naour
    Cc: Sam Ravnborg
    Cc: Yoshinori Sato
    Link: https://marc.info/?l=linux-renesas-soc&m=157664657013309
    Signed-off-by: Linus Torvalds

    Kuninori Morimoto
     
  • SH will get below warning

    ${LINUX}/drivers/sh/clk/cpg.c: In function 'r8':
    ${LINUX}/drivers/sh/clk/cpg.c:41:17: warning: passing argument 1 of 'ioread8'
    discards 'const' qualifier from pointer target type [-Wdiscarded-qualifiers]
    return ioread8(addr);
    ^~~~
    In file included from ${LINUX}/arch/sh/include/asm/io.h:21,
    from ${LINUX}/include/linux/io.h:13,
    from ${LINUX}/drivers/sh/clk/cpg.c:14:
    ${LINUX}/include/asm-generic/iomap.h:29:29: note: expected 'void *' but
    argument is of type 'const void *'
    extern unsigned int ioread8(void __iomem *);
    ^~~~~~~~~~~~~~

    We don't need "const" for r8/r16/r32. And we don't need r8/r16/r32
    themselvs. This patch cleanup these.

    Signed-off-by: Kuninori Morimoto
    Signed-off-by: Andrew Morton
    Cc: Alan Modra
    Cc: Bin Meng
    Cc: Chen Zhou
    Cc: Geert Uytterhoeven
    Cc: John Paul Adrian Glaubitz
    Cc: Krzysztof Kozlowski
    Cc: Rich Felker
    Cc: Romain Naour
    Cc: Sam Ravnborg
    Cc: Yoshinori Sato
    X-MARC-Message: https://marc.info/?l=linux-renesas-soc&m=157852973916903
    Signed-off-by: Linus Torvalds

    Kuninori Morimoto
     
  • Since the patch [1], building the kernel using a toolchain built with
    binutils 2.33.1 prevents booting a sh4 system under Qemu. Apply the patch
    provided by Alan Modra [2] that fix alignment of rodata.

    [1] https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;h=ebd2263ba9a9124d93bbc0ece63d7e0fae89b40e
    [2] https://www.sourceware.org/ml/binutils/2019-12/msg00112.html

    Signed-off-by: Romain Naour
    Signed-off-by: Andrew Morton
    Cc: Alan Modra
    Cc: Bin Meng
    Cc: Chen Zhou
    Cc: Geert Uytterhoeven
    Cc: John Paul Adrian Glaubitz
    Cc: Krzysztof Kozlowski
    Cc: Kuninori Morimoto
    Cc: Rich Felker
    Cc: Sam Ravnborg
    Cc: Yoshinori Sato
    Cc: Arnd Bergmann
    Cc:
    Link: https://marc.info/?l=linux-sh&m=158429470221261
    Signed-off-by: Linus Torvalds

    Romain Naour
     
  • BUG: KCSAN: data-race in page_cpupid_xchg_last / put_page

    write (marked) to 0xfffffc0d48ec1a00 of 8 bytes by task 91442 on cpu 3:
    page_cpupid_xchg_last+0x51/0x80
    page_cpupid_xchg_last at mm/mmzone.c:109 (discriminator 11)
    wp_page_reuse+0x3e/0xc0
    wp_page_reuse at mm/memory.c:2453
    do_wp_page+0x472/0x7b0
    do_wp_page at mm/memory.c:2798
    __handle_mm_fault+0xcb0/0xd00
    handle_pte_fault at mm/memory.c:4049
    (inlined by) __handle_mm_fault at mm/memory.c:4163
    handle_mm_fault+0xfc/0x2f0
    handle_mm_fault at mm/memory.c:4200
    do_page_fault+0x263/0x6f9
    do_user_addr_fault at arch/x86/mm/fault.c:1465
    (inlined by) do_page_fault at arch/x86/mm/fault.c:1539
    page_fault+0x34/0x40

    read to 0xfffffc0d48ec1a00 of 8 bytes by task 94817 on cpu 69:
    put_page+0x15a/0x1f0
    page_zonenum at include/linux/mm.h:923
    (inlined by) is_zone_device_page at include/linux/mm.h:929
    (inlined by) page_is_devmap_managed at include/linux/mm.h:948
    (inlined by) put_page at include/linux/mm.h:1023
    wp_page_copy+0x571/0x930
    wp_page_copy at mm/memory.c:2615
    do_wp_page+0x107/0x7b0
    __handle_mm_fault+0xcb0/0xd00
    handle_mm_fault+0xfc/0x2f0
    do_page_fault+0x263/0x6f9
    page_fault+0x34/0x40

    Reported by Kernel Concurrency Sanitizer on:
    CPU: 69 PID: 94817 Comm: systemd-udevd Tainted: G W O L 5.5.0-next-20200204+ #6
    Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019

    A page never changes its zone number. The zone number happens to be
    stored in the same word as other bits which are modified, but the zone
    number bits will never be modified by any other write, so it can accept
    a reload of the zone bits after an intervening write and it don't need
    to use READ_ONCE(). Thus, annotate this data race using
    ASSERT_EXCLUSIVE_BITS() to also assert that there are no concurrent
    writes to it.

    Suggested-by: Marco Elver
    Signed-off-by: Qian Cai
    Signed-off-by: Andrew Morton
    Cc: Paul E. McKenney
    Cc: David Hildenbrand
    Cc: Jan Kara
    Cc: John Hubbard
    Cc: Ira Weiny
    Cc: Dan Williams
    Link: http://lkml.kernel.org/r/1581619089-14472-1-git-send-email-cai@lca.pw
    Signed-off-by: Linus Torvalds

    Qian Cai
     
  • Read to lru_add_pvec->nr could be interrupted and then write to the same
    variable. The write has local interrupt disabled, but the plain reads
    result in data races. However, it is unlikely the compilers could do much
    damage here given that lru_add_pvec->nr is a "unsigned char" and there is
    an existing compiler barrier. Thus, annotate the reads using the
    data_race() macro. The data races were reported by KCSAN,

    BUG: KCSAN: data-race in lru_add_drain_cpu / rotate_reclaimable_page

    write to 0xffff9291ebcb8a40 of 1 bytes by interrupt on cpu 23:
    rotate_reclaimable_page+0x2df/0x490
    pagevec_add at include/linux/pagevec.h:81
    (inlined by) rotate_reclaimable_page at mm/swap.c:259
    end_page_writeback+0x1b5/0x2b0
    end_swap_bio_write+0x1d0/0x280
    bio_endio+0x297/0x560
    dec_pending+0x218/0x430 [dm_mod]
    clone_endio+0xe4/0x2c0 [dm_mod]
    bio_endio+0x297/0x560
    blk_update_request+0x201/0x920
    scsi_end_request+0x6b/0x4a0
    scsi_io_completion+0xb7/0x7e0
    scsi_finish_command+0x1ed/0x2a0
    scsi_softirq_done+0x1c9/0x1d0
    blk_done_softirq+0x181/0x1d0
    __do_softirq+0xd9/0x57c
    irq_exit+0xa2/0xc0
    do_IRQ+0x8b/0x190
    ret_from_intr+0x0/0x42
    delay_tsc+0x46/0x80
    __const_udelay+0x3c/0x40
    __udelay+0x10/0x20
    kcsan_setup_watchpoint+0x202/0x3a0
    __tsan_read1+0xc2/0x100
    lru_add_drain_cpu+0xb8/0x3f0
    lru_add_drain+0x25/0x40
    shrink_active_list+0xe1/0xc80
    shrink_lruvec+0x766/0xb70
    shrink_node+0x2d6/0xca0
    do_try_to_free_pages+0x1f7/0x9a0
    try_to_free_pages+0x252/0x5b0
    __alloc_pages_slowpath+0x458/0x1290
    __alloc_pages_nodemask+0x3bb/0x450
    alloc_pages_vma+0x8a/0x2c0
    do_anonymous_page+0x16e/0x6f0
    __handle_mm_fault+0xcd5/0xd40
    handle_mm_fault+0xfc/0x2f0
    do_page_fault+0x263/0x6f9
    page_fault+0x34/0x40

    read to 0xffff9291ebcb8a40 of 1 bytes by task 37761 on cpu 23:
    lru_add_drain_cpu+0xb8/0x3f0
    lru_add_drain_cpu at mm/swap.c:602
    lru_add_drain+0x25/0x40
    shrink_active_list+0xe1/0xc80
    shrink_lruvec+0x766/0xb70
    shrink_node+0x2d6/0xca0
    do_try_to_free_pages+0x1f7/0x9a0
    try_to_free_pages+0x252/0x5b0
    __alloc_pages_slowpath+0x458/0x1290
    __alloc_pages_nodemask+0x3bb/0x450
    alloc_pages_vma+0x8a/0x2c0
    do_anonymous_page+0x16e/0x6f0
    __handle_mm_fault+0xcd5/0xd40
    handle_mm_fault+0xfc/0x2f0
    do_page_fault+0x263/0x6f9
    page_fault+0x34/0x40

    2 locks held by oom02/37761:
    #0: ffff9281e5928808 (&mm->mmap_sem#2){++++}, at: do_page_fault
    #1: ffffffffb3ade380 (fs_reclaim){+.+.}, at: fs_reclaim_acquire.part
    irq event stamp: 1949217
    trace_hardirqs_on_thunk+0x1a/0x1c
    __do_softirq+0x2e7/0x57c
    __do_softirq+0x34c/0x57c
    irq_exit+0xa2/0xc0

    Reported by Kernel Concurrency Sanitizer on:
    CPU: 23 PID: 37761 Comm: oom02 Not tainted 5.6.0-rc3-next-20200226+ #6
    Hardware name: HP ProLiant BL660c Gen9, BIOS I38 10/17/2018

    Signed-off-by: Qian Cai
    Signed-off-by: Andrew Morton
    Acked-by: Marco Elver
    Link: http://lkml.kernel.org/r/20200228044018.1263-1-cai@lca.pw
    Signed-off-by: Linus Torvalds

    Qian Cai
     
  • mm->tlb_flush_batched could be accessed concurrently as noticed by
    KCSAN,

    BUG: KCSAN: data-race in flush_tlb_batched_pending / try_to_unmap_one

    write to 0xffff93f754880bd0 of 1 bytes by task 822 on cpu 6:
    try_to_unmap_one+0x59a/0x1ab0
    set_tlb_ubc_flush_pending at mm/rmap.c:635
    (inlined by) try_to_unmap_one at mm/rmap.c:1538
    rmap_walk_anon+0x296/0x650
    rmap_walk+0xdf/0x100
    try_to_unmap+0x18a/0x2f0
    shrink_page_list+0xef6/0x2870
    shrink_inactive_list+0x316/0x880
    shrink_lruvec+0x8dc/0x1380
    shrink_node+0x317/0xd80
    balance_pgdat+0x652/0xd90
    kswapd+0x396/0x8d0
    kthread+0x1e0/0x200
    ret_from_fork+0x27/0x50

    read to 0xffff93f754880bd0 of 1 bytes by task 6364 on cpu 4:
    flush_tlb_batched_pending+0x29/0x90
    flush_tlb_batched_pending at mm/rmap.c:682
    change_p4d_range+0x5dd/0x1030
    change_pte_range at mm/mprotect.c:44
    (inlined by) change_pmd_range at mm/mprotect.c:212
    (inlined by) change_pud_range at mm/mprotect.c:240
    (inlined by) change_p4d_range at mm/mprotect.c:260
    change_protection+0x222/0x310
    change_prot_numa+0x3e/0x60
    task_numa_work+0x219/0x350
    task_work_run+0xed/0x140
    prepare_exit_to_usermode+0x2cc/0x2e0
    ret_from_intr+0x32/0x42

    Reported by Kernel Concurrency Sanitizer on:
    CPU: 4 PID: 6364 Comm: mtest01 Tainted: G W L 5.5.0-next-20200210+ #5
    Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019

    flush_tlb_batched_pending() is under PTL but the write is not, but
    mm->tlb_flush_batched is only a bool type, so the value is unlikely to be
    shattered. Thus, mark it as an intentional data race by using the data
    race macro.

    Signed-off-by: Qian Cai
    Signed-off-by: Andrew Morton
    Cc: Marco Elver
    Link: http://lkml.kernel.org/r/1581450783-8262-1-git-send-email-cai@lca.pw
    Signed-off-by: Linus Torvalds

    Qian Cai
     
  • mempool_t pool.curr_nr could be accessed concurrently as noticed by
    KCSAN,

    BUG: KCSAN: data-race in mempool_free / remove_element

    write to 0xffffffffa937638c of 4 bytes by task 6359 on cpu 113:
    remove_element+0x4a/0x1c0
    remove_element at mm/mempool.c:132
    mempool_alloc+0x102/0x210
    (inlined by) mempool_alloc at mm/mempool.c:399
    bio_alloc_bioset+0x106/0x2c0
    get_swap_bio+0x49/0x230
    __swap_writepage+0x680/0xc30
    swap_writepage+0x9c/0xf0
    pageout+0x33e/0xae0
    shrink_page_list+0x1f57/0x2870
    shrink_inactive_list+0x316/0x880
    shrink_lruvec+0x8dc/0x1380
    shrink_node+0x317/0xd80
    do_try_to_free_pages+0x1f7/0xa10
    try_to_free_pages+0x26c/0x5e0
    __alloc_pages_slowpath+0x458/0x1290

    read to 0xffffffffa937638c of 4 bytes by interrupt on cpu 64:
    mempool_free+0x3e/0x150
    mempool_free at mm/mempool.c:492
    bio_free+0x192/0x280
    bio_put+0x91/0xd0
    end_swap_bio_write+0x1d8/0x280
    bio_endio+0x2c2/0x5b0
    dec_pending+0x22b/0x440 [dm_mod]
    clone_endio+0xe4/0x2c0 [dm_mod]
    bio_endio+0x2c2/0x5b0
    blk_update_request+0x217/0x940
    scsi_end_request+0x6b/0x4d0
    scsi_io_completion+0xb7/0x7e0
    scsi_finish_command+0x223/0x310
    scsi_softirq_done+0x1d5/0x210
    blk_mq_complete_request+0x224/0x250
    scsi_mq_done+0xc2/0x250
    pqi_raid_io_complete+0x5a/0x70 [smartpqi]
    pqi_irq_handler+0x150/0x1410 [smartpqi]
    __handle_irq_event_percpu+0x90/0x540
    handle_irq_event_percpu+0x49/0xd0
    handle_irq_event+0x85/0xca
    handle_edge_irq+0x13f/0x3e0
    do_IRQ+0x86/0x190

    Since the write is under pool->lock but the read is done as lockless.
    Even though the commit 5b990546e334 ("mempool: fix and document
    synchronization and memory barrier usage") introduced the smp_wmb() and
    smp_rmb() pair to improve the situation, it is adequate to protect it
    from data races which could lead to a logic bug, so fix it by adding
    READ_ONCE() for the read.

    Signed-off-by: Qian Cai
    Signed-off-by: Andrew Morton
    Cc: Marco Elver
    Cc: Tejun Heo
    Cc: Oleg Nesterov
    Link: http://lkml.kernel.org/r/1581446384-2131-1-git-send-email-cai@lca.pw
    Signed-off-by: Linus Torvalds

    Qian Cai
     
  • struct list_lru_one l.nr_items could be accessed concurrently as noticed
    by KCSAN,

    BUG: KCSAN: data-race in list_lru_count_one / list_lru_isolate_move

    write to 0xffffa102789c4510 of 8 bytes by task 823 on cpu 39:
    list_lru_isolate_move+0xf9/0x130
    list_lru_isolate_move at mm/list_lru.c:180
    inode_lru_isolate+0x12b/0x2a0
    __list_lru_walk_one+0x122/0x3d0
    list_lru_walk_one+0x75/0xa0
    prune_icache_sb+0x8b/0xc0
    super_cache_scan+0x1b8/0x250
    do_shrink_slab+0x256/0x6d0
    shrink_slab+0x41b/0x4a0
    shrink_node+0x35c/0xd80
    balance_pgdat+0x652/0xd90
    kswapd+0x396/0x8d0
    kthread+0x1e0/0x200
    ret_from_fork+0x27/0x50

    read to 0xffffa102789c4510 of 8 bytes by task 6345 on cpu 56:
    list_lru_count_one+0x116/0x2f0
    list_lru_count_one at mm/list_lru.c:193
    super_cache_count+0xe8/0x170
    do_shrink_slab+0x95/0x6d0
    shrink_slab+0x41b/0x4a0
    shrink_node+0x35c/0xd80
    do_try_to_free_pages+0x1f7/0xa10
    try_to_free_pages+0x26c/0x5e0
    __alloc_pages_slowpath+0x458/0x1290
    __alloc_pages_nodemask+0x3bb/0x450
    alloc_pages_vma+0x8a/0x2c0
    do_anonymous_page+0x170/0x700
    __handle_mm_fault+0xc9f/0xd00
    handle_mm_fault+0xfc/0x2f0
    do_page_fault+0x263/0x6f9
    page_fault+0x34/0x40

    Reported by Kernel Concurrency Sanitizer on:
    CPU: 56 PID: 6345 Comm: oom01 Tainted: G W L 5.5.0-next-20200205+ #4
    Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019

    A shattered l.nr_items could affect the shrinker behaviour due to a data
    race. Fix it by adding READ_ONCE() for the read. Since the writes are
    aligned and up to word-size, assume those are safe from data races to
    avoid readability issues of writing WRITE_ONCE(var, var + val).

    Signed-off-by: Qian Cai
    Signed-off-by: Andrew Morton
    Cc: Marco Elver
    Cc: Konrad Rzeszutek Wilk
    Link: http://lkml.kernel.org/r/1581114679-5488-1-git-send-email-cai@lca.pw
    Signed-off-by: Linus Torvalds

    Qian Cai
     
  • struct mem_cgroup_per_node mz.lru_zone_size[zone_idx][lru] could be
    accessed concurrently as noticed by KCSAN,

    BUG: KCSAN: data-race in lruvec_lru_size / mem_cgroup_update_lru_size

    write to 0xffff9c804ca285f8 of 8 bytes by task 50951 on cpu 12:
    mem_cgroup_update_lru_size+0x11c/0x1d0
    mem_cgroup_update_lru_size at mm/memcontrol.c:1266
    isolate_lru_pages+0x6a9/0xf30
    shrink_active_list+0x123/0xcc0
    shrink_lruvec+0x8fd/0x1380
    shrink_node+0x317/0xd80
    do_try_to_free_pages+0x1f7/0xa10
    try_to_free_pages+0x26c/0x5e0
    __alloc_pages_slowpath+0x458/0x1290
    __alloc_pages_nodemask+0x3bb/0x450
    alloc_pages_vma+0x8a/0x2c0
    do_anonymous_page+0x170/0x700
    __handle_mm_fault+0xc9f/0xd00
    handle_mm_fault+0xfc/0x2f0
    do_page_fault+0x263/0x6f9
    page_fault+0x34/0x40

    read to 0xffff9c804ca285f8 of 8 bytes by task 50964 on cpu 95:
    lruvec_lru_size+0xbb/0x270
    mem_cgroup_get_zone_lru_size at include/linux/memcontrol.h:536
    (inlined by) lruvec_lru_size at mm/vmscan.c:326
    shrink_lruvec+0x1d0/0x1380
    shrink_node+0x317/0xd80
    do_try_to_free_pages+0x1f7/0xa10
    try_to_free_pages+0x26c/0x5e0
    __alloc_pages_slowpath+0x458/0x1290
    __alloc_pages_nodemask+0x3bb/0x450
    alloc_pages_current+0xa6/0x120
    alloc_slab_page+0x3b1/0x540
    allocate_slab+0x70/0x660
    new_slab+0x46/0x70
    ___slab_alloc+0x4ad/0x7d0
    __slab_alloc+0x43/0x70
    kmem_cache_alloc+0x2c3/0x420
    getname_flags+0x4c/0x230
    getname+0x22/0x30
    do_sys_openat2+0x205/0x3b0
    do_sys_open+0x9a/0xf0
    __x64_sys_openat+0x62/0x80
    do_syscall_64+0x91/0xb47
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    Reported by Kernel Concurrency Sanitizer on:
    CPU: 95 PID: 50964 Comm: cc1 Tainted: G W O L 5.5.0-next-20200204+ #6
    Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019

    The write is under lru_lock, but the read is done as lockless. The scan
    count is used to determine how aggressively the anon and file LRU lists
    should be scanned. Load tearing could generate an inefficient heuristic,
    so fix it by adding READ_ONCE() for the read.

    Signed-off-by: Qian Cai
    Signed-off-by: Andrew Morton
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Link: http://lkml.kernel.org/r/20200206034945.2481-1-cai@lca.pw
    Signed-off-by: Linus Torvalds

    Qian Cai
     
  • Commit 3e32cb2e0a12 ("mm: memcontrol: lockless page counters") could had
    memcg->memsw->watermark and memcg->memsw->failcnt been accessed
    concurrently as reported by KCSAN,

    BUG: KCSAN: data-race in page_counter_try_charge / page_counter_try_charge

    read to 0xffff8fb18c4cd190 of 8 bytes by task 1081 on cpu 59:
    page_counter_try_charge+0x4d/0x150 mm/page_counter.c:138
    try_charge+0x131/0xd50 mm/memcontrol.c:2405
    __memcg_kmem_charge_memcg+0x58/0x140
    __memcg_kmem_charge+0xcc/0x280
    __alloc_pages_nodemask+0x1e1/0x450
    alloc_pages_current+0xa6/0x120
    pte_alloc_one+0x17/0xd0
    __pte_alloc+0x3a/0x1f0
    copy_p4d_range+0xc36/0x1990
    copy_page_range+0x21d/0x360
    dup_mmap+0x5f5/0x7a0
    dup_mm+0xa2/0x240
    copy_process+0x1b3f/0x3460
    _do_fork+0xaa/0xa20
    __x64_sys_clone+0x13b/0x170
    do_syscall_64+0x91/0xb47
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    write to 0xffff8fb18c4cd190 of 8 bytes by task 1153 on cpu 120:
    page_counter_try_charge+0x5b/0x150 mm/page_counter.c:139
    try_charge+0x131/0xd50 mm/memcontrol.c:2405
    mem_cgroup_try_charge+0x159/0x460
    mem_cgroup_try_charge_delay+0x3d/0xa0
    wp_page_copy+0x14d/0x930
    do_wp_page+0x107/0x7b0
    __handle_mm_fault+0xce6/0xd40
    handle_mm_fault+0xfc/0x2f0
    do_page_fault+0x263/0x6f9
    page_fault+0x34/0x40

    BUG: KCSAN: data-race in page_counter_try_charge / page_counter_try_charge

    write to 0xffff88809bbf2158 of 8 bytes by task 11782 on cpu 0:
    page_counter_try_charge+0x100/0x170 mm/page_counter.c:129
    try_charge+0x185/0xbf0 mm/memcontrol.c:2405
    __memcg_kmem_charge_memcg+0x4a/0xe0 mm/memcontrol.c:2837
    __memcg_kmem_charge+0xcf/0x1b0 mm/memcontrol.c:2877
    __alloc_pages_nodemask+0x26c/0x310 mm/page_alloc.c:4780

    read to 0xffff88809bbf2158 of 8 bytes by task 11814 on cpu 1:
    page_counter_try_charge+0xef/0x170 mm/page_counter.c:129
    try_charge+0x185/0xbf0 mm/memcontrol.c:2405
    __memcg_kmem_charge_memcg+0x4a/0xe0 mm/memcontrol.c:2837
    __memcg_kmem_charge+0xcf/0x1b0 mm/memcontrol.c:2877
    __alloc_pages_nodemask+0x26c/0x310 mm/page_alloc.c:4780

    Since watermark could be compared or set to garbage due to a data race
    which would change the code logic, fix it by adding a pair of READ_ONCE()
    and WRITE_ONCE() in those places.

    The "failcnt" counter is tolerant of some degree of inaccuracy and is only
    used to report stats, a data race will not be harmful, thus mark it as an
    intentional data race using the data_race() macro.

    Fixes: 3e32cb2e0a12 ("mm: memcontrol: lockless page counters")
    Reported-by: syzbot+f36cfe60b1006a94f9dc@syzkaller.appspotmail.com
    Signed-off-by: Qian Cai
    Signed-off-by: Andrew Morton
    Acked-by: Michal Hocko
    Cc: David Hildenbrand
    Cc: Tetsuo Handa
    Cc: Marco Elver
    Cc: Dmitry Vyukov
    Cc: Johannes Weiner
    Link: http://lkml.kernel.org/r/1581519682-23594-1-git-send-email-cai@lca.pw
    Signed-off-by: Linus Torvalds

    Qian Cai
     
  • swap_info_struct si.highest_bit, si.swap_map[offset] and si.flags could
    be accessed concurrently separately as noticed by KCSAN,

    === si.highest_bit ===

    write to 0xffff8d5abccdc4d4 of 4 bytes by task 5353 on cpu 24:
    swap_range_alloc+0x81/0x130
    swap_range_alloc at mm/swapfile.c:681
    scan_swap_map_slots+0x371/0xb90
    get_swap_pages+0x39d/0x5c0
    get_swap_page+0xf2/0x524
    add_to_swap+0xe4/0x1c0
    shrink_page_list+0x1795/0x2870
    shrink_inactive_list+0x316/0x880
    shrink_lruvec+0x8dc/0x1380
    shrink_node+0x317/0xd80
    do_try_to_free_pages+0x1f7/0xa10
    try_to_free_pages+0x26c/0x5e0
    __alloc_pages_slowpath+0x458/0x1290

    read to 0xffff8d5abccdc4d4 of 4 bytes by task 6672 on cpu 70:
    scan_swap_map_slots+0x4a6/0xb90
    scan_swap_map_slots at mm/swapfile.c:892
    get_swap_pages+0x39d/0x5c0
    get_swap_page+0xf2/0x524
    add_to_swap+0xe4/0x1c0
    shrink_page_list+0x1795/0x2870
    shrink_inactive_list+0x316/0x880
    shrink_lruvec+0x8dc/0x1380
    shrink_node+0x317/0xd80
    do_try_to_free_pages+0x1f7/0xa10
    try_to_free_pages+0x26c/0x5e0
    __alloc_pages_slowpath+0x458/0x1290

    Reported by Kernel Concurrency Sanitizer on:
    CPU: 70 PID: 6672 Comm: oom01 Tainted: G W L 5.5.0-next-20200205+ #3
    Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019

    === si.swap_map[offset] ===

    write to 0xffffbc370c29a64c of 1 bytes by task 6856 on cpu 86:
    __swap_entry_free_locked+0x8c/0x100
    __swap_entry_free_locked at mm/swapfile.c:1209 (discriminator 4)
    __swap_entry_free.constprop.20+0x69/0xb0
    free_swap_and_cache+0x53/0xa0
    unmap_page_range+0x7f8/0x1d70
    unmap_single_vma+0xcd/0x170
    unmap_vmas+0x18b/0x220
    exit_mmap+0xee/0x220
    mmput+0x10e/0x270
    do_exit+0x59b/0xf40
    do_group_exit+0x8b/0x180

    read to 0xffffbc370c29a64c of 1 bytes by task 6855 on cpu 20:
    _swap_info_get+0x81/0xa0
    _swap_info_get at mm/swapfile.c:1140
    free_swap_and_cache+0x40/0xa0
    unmap_page_range+0x7f8/0x1d70
    unmap_single_vma+0xcd/0x170
    unmap_vmas+0x18b/0x220
    exit_mmap+0xee/0x220
    mmput+0x10e/0x270
    do_exit+0x59b/0xf40
    do_group_exit+0x8b/0x180

    === si.flags ===

    write to 0xffff956c8fc6c400 of 8 bytes by task 6087 on cpu 23:
    scan_swap_map_slots+0x6fe/0xb50
    scan_swap_map_slots at mm/swapfile.c:887
    get_swap_pages+0x39d/0x5c0
    get_swap_page+0x377/0x524
    add_to_swap+0xe4/0x1c0
    shrink_page_list+0x1795/0x2870
    shrink_inactive_list+0x316/0x880
    shrink_lruvec+0x8dc/0x1380
    shrink_node+0x317/0xd80
    do_try_to_free_pages+0x1f7/0xa10
    try_to_free_pages+0x26c/0x5e0
    __alloc_pages_slowpath+0x458/0x1290

    read to 0xffff956c8fc6c400 of 8 bytes by task 6207 on cpu 63:
    _swap_info_get+0x41/0xa0
    __swap_info_get at mm/swapfile.c:1114
    put_swap_page+0x84/0x490
    __remove_mapping+0x384/0x5f0
    shrink_page_list+0xff1/0x2870
    shrink_inactive_list+0x316/0x880
    shrink_lruvec+0x8dc/0x1380
    shrink_node+0x317/0xd80
    do_try_to_free_pages+0x1f7/0xa10
    try_to_free_pages+0x26c/0x5e0
    __alloc_pages_slowpath+0x458/0x1290

    The writes are under si->lock but the reads are not. For si.highest_bit
    and si.swap_map[offset], data race could trigger logic bugs, so fix them
    by having WRITE_ONCE() for the writes and READ_ONCE() for the reads
    except those isolated reads where they compare against zero which a data
    race would cause no harm. Thus, annotate them as intentional data races
    using the data_race() macro.

    For si.flags, the readers are only interested in a single bit where a
    data race there would cause no issue there.

    [cai@lca.pw: add a missing annotation for si->flags in memory.c]
    Link: http://lkml.kernel.org/r/1581612647-5958-1-git-send-email-cai@lca.pw

    Signed-off-by: Qian Cai
    Signed-off-by: Andrew Morton
    Cc: Marco Elver
    Cc: Hugh Dickins
    Link: http://lkml.kernel.org/r/1581095163-12198-1-git-send-email-cai@lca.pw
    Signed-off-by: Linus Torvalds

    Qian Cai
     
  • struct file_ra_state ra.mmap_miss could be accessed concurrently during
    page faults as noticed by KCSAN,

    BUG: KCSAN: data-race in filemap_fault / filemap_map_pages

    write to 0xffff9b1700a2c1b4 of 4 bytes by task 3292 on cpu 30:
    filemap_fault+0x920/0xfc0
    do_sync_mmap_readahead at mm/filemap.c:2384
    (inlined by) filemap_fault at mm/filemap.c:2486
    __xfs_filemap_fault+0x112/0x3e0 [xfs]
    xfs_filemap_fault+0x74/0x90 [xfs]
    __do_fault+0x9e/0x220
    do_fault+0x4a0/0x920
    __handle_mm_fault+0xc69/0xd00
    handle_mm_fault+0xfc/0x2f0
    do_page_fault+0x263/0x6f9
    page_fault+0x34/0x40

    read to 0xffff9b1700a2c1b4 of 4 bytes by task 3313 on cpu 32:
    filemap_map_pages+0xc2e/0xd80
    filemap_map_pages at mm/filemap.c:2625
    do_fault+0x3da/0x920
    __handle_mm_fault+0xc69/0xd00
    handle_mm_fault+0xfc/0x2f0
    do_page_fault+0x263/0x6f9
    page_fault+0x34/0x40

    Reported by Kernel Concurrency Sanitizer on:
    CPU: 32 PID: 3313 Comm: systemd-udevd Tainted: G W L 5.5.0-next-20200210+ #1
    Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019

    ra.mmap_miss is used to contribute the readahead decisions, a data race
    could be undesirable. Both the read and write is only under non-exclusive
    mmap_sem, two concurrent writers could even underflow the counter. Fix
    the underflow by writing to a local variable before committing a final
    store to ra.mmap_miss given a small inaccuracy of the counter should be
    acceptable.

    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Qian Cai
    Signed-off-by: Andrew Morton
    Tested-by: Qian Cai
    Reviewed-by: Matthew Wilcox (Oracle)
    Cc: Marco Elver
    Link: http://lkml.kernel.org/r/20200211030134.1847-1-cai@lca.pw
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov