06 Jun, 2020

1 commit

  • Pull READ_IMPLIES_EXEC changes from Borislav Petkov:
    "Split the old READ_IMPLIES_EXEC workaround from executable
    PT_GNU_STACK now that toolchains long support PT_GNU_STACK marking and
    there's no need anymore to force modern programs into having all its
    user mappings executable instead of only the stack and the PROT_EXEC
    ones.

    Disable that automatic READ_IMPLIES_EXEC forcing on x86-64 and
    arm64.

    Add tables documenting how READ_IMPLIES_EXEC is handled on x86-64, arm
    and arm64.

    By Kees Cook"

    * tag 'core_core_updates_for_5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    arm64/elf: Disable automatic READ_IMPLIES_EXEC for 64-bit address spaces
    arm32/64/elf: Split READ_IMPLIES_EXEC from executable PT_GNU_STACK
    arm32/64/elf: Add tables to document READ_IMPLIES_EXEC
    x86/elf: Disable automatic READ_IMPLIES_EXEC on 64-bit
    x86/elf: Split READ_IMPLIES_EXEC from executable PT_GNU_STACK
    x86/elf: Add table to document READ_IMPLIES_EXEC

    Linus Torvalds
     

05 Jun, 2020

13 commits

  • Merge yet more updates from Andrew Morton:

    - More MM work. 100ish more to go. Mike Rapoport's "mm: remove
    __ARCH_HAS_5LEVEL_HACK" series should fix the current ppc issue

    - Various other little subsystems

    * emailed patches from Andrew Morton : (127 commits)
    lib/ubsan.c: fix gcc-10 warnings
    tools/testing/selftests/vm: remove duplicate headers
    selftests: vm: pkeys: fix multilib builds for x86
    selftests: vm: pkeys: use the correct page size on powerpc
    selftests/vm/pkeys: override access right definitions on powerpc
    selftests/vm/pkeys: test correct behaviour of pkey-0
    selftests/vm/pkeys: introduce a sub-page allocator
    selftests/vm/pkeys: detect write violation on a mapped access-denied-key page
    selftests/vm/pkeys: associate key on a mapped page and detect write violation
    selftests/vm/pkeys: associate key on a mapped page and detect access violation
    selftests/vm/pkeys: improve checks to determine pkey support
    selftests/vm/pkeys: fix assertion in test_pkey_alloc_exhaust()
    selftests/vm/pkeys: fix number of reserved powerpc pkeys
    selftests/vm/pkeys: introduce powerpc support
    selftests/vm/pkeys: introduce generic pkey abstractions
    selftests: vm: pkeys: use the correct huge page size
    selftests/vm/pkeys: fix alloc_random_pkey() to make it really random
    selftests/vm/pkeys: fix assertion in pkey_disable_set/clear()
    selftests/vm/pkeys: fix pkey_disable_clear()
    selftests: vm: pkeys: add helpers for pkey bits
    ...

    Linus Torvalds
     
  • Currently copy_string_kernel is just a wrapper around copy_strings that
    simplifies the calling conventions and uses set_fs to allow passing a
    kernel pointer. But due to the fact the we only need to handle a single
    kernel argument pointer, the logic can be sigificantly simplified while
    getting rid of the set_fs.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Cc: Alexander Viro
    Link: http://lkml.kernel.org/r/20200501104105.2621149-3-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • copy_strings_kernel is always used with a single argument,
    adjust the calling convention to that.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Cc: Alexander Viro
    Link: http://lkml.kernel.org/r/20200501104105.2621149-2-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Use a more common logging style.

    Add and use pr_fmt, coalesce the format string, align arguments,
    use better grammar.

    Signed-off-by: Joe Perches
    Signed-off-by: Andrew Morton
    Cc: Vasily Averin
    Link: http://lkml.kernel.org/r/96ff603230ca1bd60034c36519be3930c3a3a226.camel@perches.com
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • Current readahead for FAT entries is very simple but is having some flaws,
    so it is not working well for some environments. This patch improves the
    readahead more or less.

    The key points of modification are,

    - make the readahead size tunable by using bdi->ra_pages
    - care the bdi->io_pages to avoid the small size I/O request
    - update readahead window before fully exhausting

    With this patch, on slow USB connected 2TB hdd:

    [before]
    383.18sec

    [after]
    51.03sec

    Signed-off-by: OGAWA Hirofumi
    Signed-off-by: Andrew Morton
    Tested-by: hyeongseok.kim
    Reviewed-by: hyeongseok.kim
    Link: http://lkml.kernel.org/r/87d08e1dlh.fsf@mail.parknet.co.jp
    Signed-off-by: Linus Torvalds

    OGAWA Hirofumi
     
  • If FAT length == 0, the image doesn't have any data. And it can be the
    cause of overlapping the root dir and FAT entries.

    Also Windows treats it as invalid format.

    Reported-by: syzbot+6f1624f937d9d6911e2d@syzkaller.appspotmail.com
    Signed-off-by: OGAWA Hirofumi
    Signed-off-by: Andrew Morton
    Cc: Marco Elver
    Cc: Dmitry Vyukov
    Link: http://lkml.kernel.org/r/87r1wz8mrd.fsf@mail.parknet.co.jp
    Signed-off-by: Linus Torvalds

    OGAWA Hirofumi
     
  • The ifndef was added a long time ago to support archs that would define
    their own mapping function. The last user was the metag arch which was
    removed from the tree, and as such there are no users left. Let's kill
    it.

    Signed-off-by: Anthony Iliopoulos
    Signed-off-by: Andrew Morton
    Link: http://lkml.kernel.org/r/20200402161543.4119-1-ailiop@suse.com
    Signed-off-by: Linus Torvalds

    Anthony Iliopoulos
     
  • "catch" is reserved keyword in C++, rename it to something both gcc and
    g++ accept.

    Rename "ign" for symmetry.

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Link: http://lkml.kernel.org/r/20200331210905.GA31680@avx2
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Pull execve updates from Eric Biederman:
    "Last cycle for the Nth time I ran into bugs and quality of
    implementation issues related to exec that could not be easily be
    fixed because of the way exec is implemented. So I have been digging
    into exec and cleanup up what I can.

    I don't think I have exec sorted out enough to fix the issues I
    started with but I have made some headway this cycle with 4 sets of
    changes.

    - promised cleanups after introducing exec_update_mutex

    - trivial cleanups for exec

    - control flow simplifications

    - remove the recomputation of bprm->cred

    The net result is code that is a bit easier to understand and work
    with and a decrease in the number of lines of code (if you don't count
    the added tests)"

    * 'exec-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (24 commits)
    exec: Compute file based creds only once
    exec: Add a per bprm->file version of per_clear
    binfmt_elf_fdpic: fix execfd build regression
    selftests/exec: Add binfmt_script regression test
    exec: Remove recursion from search_binary_handler
    exec: Generic execfd support
    exec/binfmt_script: Don't modify bprm->buf and then return -ENOEXEC
    exec: Move the call of prepare_binprm into search_binary_handler
    exec: Allow load_misc_binary to call prepare_binprm unconditionally
    exec: Convert security_bprm_set_creds into security_bprm_repopulate_creds
    exec: Factor security_bprm_creds_for_exec out of security_bprm_set_creds
    exec: Teach prepare_exec_creds how exec treats uids & gids
    exec: Set the point of no return sooner
    exec: Move handling of the point of no return to the top level
    exec: Run sync_mm_rss before taking exec_update_mutex
    exec: Fix spelling of search_binary_handler in a comment
    exec: Move the comment from above de_thread to above unshare_sighand
    exec: Rename flush_old_exec begin_new_exec
    exec: Move most of setup_new_exec into flush_old_exec
    exec: In setup_new_exec cache current in the local variable me
    ...

    Linus Torvalds
     
  • Pull proc updates from Eric Biederman:
    "This has four sets of changes:

    - modernize proc to support multiple private instances

    - ensure we see the exit of each process tid exactly

    - remove has_group_leader_pid

    - use pids not tasks in posix-cpu-timers lookup

    Alexey updated proc so each mount of proc uses a new superblock. This
    allows people to actually use mount options with proc with no fear of
    messing up another mount of proc. Given the kernel's internal mounts
    of proc for things like uml this was a real problem, and resulted in
    Android's hidepid mount options being ignored and introducing security
    issues.

    The rest of the changes are small cleanups and fixes that came out of
    my work to allow this change to proc. In essence it is swapping the
    pids in de_thread during exec which removes a special case the code
    had to handle. Then updating the code to stop handling that special
    case"

    * 'proc-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    proc: proc_pid_ns takes super_block as an argument
    remove the no longer needed pid_alive() check in __task_pid_nr_ns()
    posix-cpu-timers: Replace __get_task_for_clock with pid_for_clock
    posix-cpu-timers: Replace cpu_timer_pid_type with clock_pid_type
    posix-cpu-timers: Extend rcu_read_lock removing task_struct references
    signal: Remove has_group_leader_pid
    exec: Remove BUG_ON(has_group_leader_pid)
    posix-cpu-timer: Unify the now redundant code in lookup_task
    posix-cpu-timer: Tidy up group_leader logic in lookup_task
    proc: Ensure we see the exit of each process tid exactly once
    rculist: Add hlists_swap_heads_rcu
    proc: Use PIDTYPE_TGID in next_tgid
    Use proc_pid_ns() to get pid_namespace from the proc superblock
    proc: use named enums for better readability
    proc: use human-readable values for hidepid
    docs: proc: add documentation for "hidepid=4" and "subset=pid" options and new mount behavior
    proc: add option to mount only a pids subset
    proc: instantiate only pids that we can ptrace on 'hidepid=4' mount option
    proc: allow to mount many instances of proc in one pid namespace
    proc: rename struct proc_fs_info to proc_fs_opts

    Linus Torvalds
     
  • Pull ext2 and reiserfs cleanups from Jan Kara:
    "Two small cleanups for ext2 and one for reiserfs"

    * tag 'for_v5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
    reiserfs: Replace kmalloc with kcalloc in the comment
    ext2: code cleanup by removing ifdef macro surrounding
    ext2: Fix i_op setting for special inode

    Linus Torvalds
     
  • Pull fsnotify updates from Jan Kara:
    "Several smaller fixes and cleanups for fsnotify subsystem"

    * tag 'fsnotify_for_v5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
    fanotify: fix ignore mask logic for events on child and on dir
    fanotify: don't write with size under sizeof(response)
    fsnotify: Remove proc_fs.h include
    fanotify: remove reference to fill_event_metadata()
    fsnotify: add mutex destroy
    fanotify: prefix should_merge()
    fanotify: Replace zero-length array with flexible-array
    inotify: Fix error return code assignment flow.
    fsnotify: Add missing annotation for fsnotify_finish_user_wait() and for fsnotify_prepare_user_wait()

    Linus Torvalds
     
  • Pull zonefs update from Damien Le Moal:
    "Only one patch in this pull request to cleanup handling of uuid using
    the import_uuid() helper, from Andy"

    * tag 'zonefs-5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs:
    zonefs: Replace uuid_copy() with import_uuid()

    Linus Torvalds
     

04 Jun, 2020

6 commits

  • Merge more updates from Andrew Morton:
    "More mm/ work, plenty more to come

    Subsystems affected by this patch series: slub, memcg, gup, kasan,
    pagealloc, hugetlb, vmscan, tools, mempolicy, memblock, hugetlbfs,
    thp, mmap, kconfig"

    * akpm: (131 commits)
    arm64: mm: use ARCH_HAS_DEBUG_WX instead of arch defined
    x86: mm: use ARCH_HAS_DEBUG_WX instead of arch defined
    riscv: support DEBUG_WX
    mm: add DEBUG_WX support
    drivers/base/memory.c: cache memory blocks in xarray to accelerate lookup
    mm/thp: rename pmd_mknotpresent() as pmd_mkinvalid()
    powerpc/mm: drop platform defined pmd_mknotpresent()
    mm: thp: don't need to drain lru cache when splitting and mlocking THP
    hugetlbfs: get unmapped area below TASK_UNMAPPED_BASE for hugetlbfs
    sparc32: register memory occupied by kernel as memblock.memory
    include/linux/memblock.h: fix minor typo and unclear comment
    mm, mempolicy: fix up gup usage in lookup_node
    tools/vm/page_owner_sort.c: filter out unneeded line
    mm: swap: memcg: fix memcg stats for huge pages
    mm: swap: fix vmstats for huge pages
    mm: vmscan: limit the range of LRU type balancing
    mm: vmscan: reclaim writepage is IO cost
    mm: vmscan: determine anon/file pressure balance at the reclaim root
    mm: balance LRU lists based on relative thrashing
    mm: only count actual rotations as LRU reclaim cost
    ...

    Linus Torvalds
     
  • In a 32-bit program, running on arm64 architecture. When the address
    space below mmap base is completely exhausted, shmat() for huge pages will
    return ENOMEM, but shmat() for normal pages can still success on no-legacy
    mode. This seems not fair.

    For normal pages, the calling trace of get_unmapped_area() is:

    => mm->get_unmapped_area()
    if on legacy mode,
    => arch_get_unmapped_area()
    => vm_unmapped_area()
    if on no-legacy mode,
    => arch_get_unmapped_area_topdown()
    => vm_unmapped_area()

    For huge pages, the calling trace of get_unmapped_area() is:

    => file->f_op->get_unmapped_area()
    => hugetlb_get_unmapped_area()
    => vm_unmapped_area()

    To solve this issue, we only need to make hugetlb_get_unmapped_area() take
    the same way as mm->get_unmapped_area(). Add *bottomup() and *topdown()
    for hugetlbfs, and check current mm->get_unmapped_area() to decide which
    one to use. If mm->get_unmapped_area is equal to
    arch_get_unmapped_area_topdown(), hugetlb_get_unmapped_area() calls
    topdown routine, otherwise calls bottomup routine.

    Reported-by: kbuild test robot
    Signed-off-by: Shijie Hu
    Signed-off-by: Mike Kravetz
    Signed-off-by: Andrew Morton
    Cc: Will Deacon
    Cc: Xiaoming Ni
    Cc: Kefeng Wang
    Cc: yangerkun
    Cc: ChenGang
    Cc: Chen Jie
    Link: http://lkml.kernel.org/r/20200518065338.113664-1-hushijie3@huawei.com
    Signed-off-by: Linus Torvalds

    Shijie Hu
     
  • They're the same function, and for the purpose of all callers they are
    equivalent to lru_cache_add().

    [akpm@linux-foundation.org: fix it for local_lock changes]
    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Reviewed-by: Rik van Riel
    Acked-by: Michal Hocko
    Acked-by: Minchan Kim
    Cc: Joonsoo Kim
    Link: http://lkml.kernel.org/r/20200520232525.798933-5-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Pull networking updates from David Miller:

    1) Allow setting bluetooth L2CAP modes via socket option, from Luiz
    Augusto von Dentz.

    2) Add GSO partial support to igc, from Sasha Neftin.

    3) Several cleanups and improvements to r8169 from Heiner Kallweit.

    4) Add IF_OPER_TESTING link state and use it when ethtool triggers a
    device self-test. From Andrew Lunn.

    5) Start moving away from custom driver versions, use the globally
    defined kernel version instead, from Leon Romanovsky.

    6) Support GRO vis gro_cells in DSA layer, from Alexander Lobakin.

    7) Allow hard IRQ deferral during NAPI, from Eric Dumazet.

    8) Add sriov and vf support to hinic, from Luo bin.

    9) Support Media Redundancy Protocol (MRP) in the bridging code, from
    Horatiu Vultur.

    10) Support netmap in the nft_nat code, from Pablo Neira Ayuso.

    11) Allow UDPv6 encapsulation of ESP in the ipsec code, from Sabrina
    Dubroca. Also add ipv6 support for espintcp.

    12) Lots of ReST conversions of the networking documentation, from Mauro
    Carvalho Chehab.

    13) Support configuration of ethtool rxnfc flows in bcmgenet driver,
    from Doug Berger.

    14) Allow to dump cgroup id and filter by it in inet_diag code, from
    Dmitry Yakunin.

    15) Add infrastructure to export netlink attribute policies to
    userspace, from Johannes Berg.

    16) Several optimizations to sch_fq scheduler, from Eric Dumazet.

    17) Fallback to the default qdisc if qdisc init fails because otherwise
    a packet scheduler init failure will make a device inoperative. From
    Jesper Dangaard Brouer.

    18) Several RISCV bpf jit optimizations, from Luke Nelson.

    19) Correct the return type of the ->ndo_start_xmit() method in several
    drivers, it's netdev_tx_t but many drivers were using
    'int'. From Yunjian Wang.

    20) Add an ethtool interface for PHY master/slave config, from Oleksij
    Rempel.

    21) Add BPF iterators, from Yonghang Song.

    22) Add cable test infrastructure, including ethool interfaces, from
    Andrew Lunn. Marvell PHY driver is the first to support this
    facility.

    23) Remove zero-length arrays all over, from Gustavo A. R. Silva.

    24) Calculate and maintain an explicit frame size in XDP, from Jesper
    Dangaard Brouer.

    25) Add CAP_BPF, from Alexei Starovoitov.

    26) Support terse dumps in the packet scheduler, from Vlad Buslov.

    27) Support XDP_TX bulking in dpaa2 driver, from Ioana Ciornei.

    28) Add devm_register_netdev(), from Bartosz Golaszewski.

    29) Minimize qdisc resets, from Cong Wang.

    30) Get rid of kernel_getsockopt and kernel_setsockopt in order to
    eliminate set_fs/get_fs calls. From Christoph Hellwig.

    * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2517 commits)
    selftests: net: ip_defrag: ignore EPERM
    net_failover: fixed rollback in net_failover_open()
    Revert "tipc: Fix potential tipc_aead refcnt leak in tipc_crypto_rcv"
    Revert "tipc: Fix potential tipc_node refcnt leak in tipc_rcv"
    vmxnet3: allow rx flow hash ops only when rss is enabled
    hinic: add set_channels ethtool_ops support
    selftests/bpf: Add a default $(CXX) value
    tools/bpf: Don't use $(COMPILE.c)
    bpf, selftests: Use bpf_probe_read_kernel
    s390/bpf: Use bcr 0,%0 as tail call nop filler
    s390/bpf: Maintain 8-byte stack alignment
    selftests/bpf: Fix verifier test
    selftests/bpf: Fix sample_cnt shared between two threads
    bpf, selftests: Adapt cls_redirect to call csum_level helper
    bpf: Add csum_level helper for fixing up csum levels
    bpf: Fix up bpf_skb_adjust_room helper's skb csum setting
    sfc: add missing annotation for efx_ef10_try_update_nic_stats_vf()
    crypto/chtls: IPv6 support for inline TLS
    Crypto/chcr: Fixes a coccinile check error
    Crypto/chcr: Fixes compilations warnings
    ...

    Linus Torvalds
     
  • Pull splice updates from Al Viro:
    "Christoph's assorted splice cleanups"

    * 'work.splice' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    fs: rename pipe_buf ->steal to ->try_steal
    fs: make the pipe_buf_operations ->confirm operation optional
    fs: make the pipe_buf_operations ->steal operation optional
    trace: remove tracing_pipe_buf_ops
    pipe: merge anon_pipe_buf*_ops
    fs: simplify do_splice_from
    fs: simplify do_splice_to

    Linus Torvalds
     
  • Pull thread updates from Christian Brauner:
    "We have been discussing using pidfds to attach to namespaces for quite
    a while and the patches have in one form or another already existed
    for about a year. But I wanted to wait to see how the general api
    would be received and adopted.

    This contains the changes to make it possible to use pidfds to attach
    to the namespaces of a process, i.e. they can be passed as the first
    argument to the setns() syscall.

    When only a single namespace type is specified the semantics are
    equivalent to passing an nsfd. That means setns(nsfd, CLONE_NEWNET)
    equals setns(pidfd, CLONE_NEWNET).

    However, when a pidfd is passed, multiple namespace flags can be
    specified in the second setns() argument and setns() will attach the
    caller to all the specified namespaces all at once or to none of them.

    Specifying 0 is not valid together with a pidfd. Here are just two
    obvious examples:

    setns(pidfd, CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET);
    setns(pidfd, CLONE_NEWUSER);

    Allowing to also attach subsets of namespaces supports various
    use-cases where callers setns to a subset of namespaces to retain
    privilege, perform an action and then re-attach another subset of
    namespaces.

    Apart from significantly reducing the number of syscalls needed to
    attach to all currently supported namespaces (eight "open+setns"
    sequences vs just a single "setns()"), this also allows atomic setns
    to a set of namespaces, i.e. either attaching to all namespaces
    succeeds or we fail without having changed anything.

    This is centered around a new internal struct nsset which holds all
    information necessary for a task to switch to a new set of namespaces
    atomically. Fwiw, with this change a pidfd becomes the only token
    needed to interact with a container. I'm expecting this to be
    picked-up by util-linux for nsenter rather soon.

    Associated with this change is a shiny new test-suite dedicated to
    setns() (for pidfds and nsfds alike)"

    * tag 'threads-v5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
    selftests/pidfd: add pidfd setns tests
    nsproxy: attach to namespaces via pidfds
    nsproxy: add struct nsset

    Linus Torvalds
     

03 Jun, 2020

20 commits

  • Pull erofs updates from Gao Xiang:
    "The most interesting part is the new mount api conversion, which is
    actually a old patch already pending for several cycles. And the
    others are recent trivial cleanups here.

    Summary:

    - Convert to use the new mount apis

    - Some random cleanup patches"

    * tag 'erofs-for-5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
    erofs: suppress false positive last_block warning
    erofs: convert to use the new mount fs_context api
    erofs: code cleanup by removing ifdef macro surrounding

    Linus Torvalds
     
  • Pull JFS update from David Kleikamp:
    "Replace zero-length array in JFS"

    * tag 'jfs-5.8' of git://github.com/kleikamp/linux-shaggy:
    jfs: Replace zero-length array with flexible-array member

    Linus Torvalds
     
  • Pull btrfs updates from David Sterba:
    "Highlights:

    - speedup dead root detection during orphan cleanup, eg. when there
    are many deleted subvolumes waiting to be cleaned, the trees are
    now looked up in radix tree instead of a O(N^2) search

    - snapshot creation with inherited qgroup will mark the qgroup
    inconsistent, requires a rescan

    - send will emit file capabilities after chown, this produces a
    stream that does not need postprocessing to set the capabilities
    again

    - direct io ported to iomap infrastructure, cleaned up and simplified
    code, notably removing last use of struct buffer_head in btrfs code

    Core changes:

    - factor out backreference iteration, to be used by ordinary
    backreferences and relocation code

    - improved global block reserve utilization
    * better logic to serialize requests
    * increased maximum available for unlink
    * improved handling on large pages (64K)

    - direct io cleanups and fixes
    * simplify layering, where cloned bios were unnecessarily created
    for some cases
    * error handling fixes (submit, endio)
    * remove repair worker thread, used to avoid deadlocks during
    repair

    - refactored block group reading code, preparatory work for new type
    of block group storage that should improve mount time on large
    filesystems

    Cleanups:

    - cleaned up (and slightly sped up) set/get helpers for metadata data
    structure members

    - root bit REF_COWS got renamed to SHAREABLE to reflect the that the
    blocks of the tree get shared either among subvolumes or with the
    relocation trees

    Fixes:

    - when subvolume deletion fails due to ENOSPC, the filesystem is not
    turned read-only

    - device scan deals with devices from other filesystems that changed
    ownership due to overwrite (mkfs)

    - fix a race between scrub and block group removal/allocation

    - fix long standing bug of a runaway balance operation, printing the
    same line to the syslog, caused by a stale status bit on a reloc
    tree that prevented progress

    - fix corrupt log due to concurrent fsync of inodes with shared
    extents

    - fix space underflow for NODATACOW and buffered writes when it for
    some reason needs to fallback to COW mode"

    * tag 'for-5.8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (133 commits)
    btrfs: fix space_info bytes_may_use underflow during space cache writeout
    btrfs: fix space_info bytes_may_use underflow after nocow buffered write
    btrfs: fix wrong file range cleanup after an error filling dealloc range
    btrfs: remove redundant local variable in read_block_for_search
    btrfs: open code key_search
    btrfs: split btrfs_direct_IO to read and write part
    btrfs: remove BTRFS_INODE_READDIO_NEED_LOCK
    fs: remove dio_end_io()
    btrfs: switch to iomap_dio_rw() for dio
    iomap: remove lockdep_assert_held()
    iomap: add a filesystem hook for direct I/O bio submission
    fs: export generic_file_buffered_read()
    btrfs: turn space cache writeout failure messages into debug messages
    btrfs: include error on messages about failure to write space/inode caches
    btrfs: remove useless 'fail_unlock' label from btrfs_csum_file_blocks()
    btrfs: do not ignore error from btrfs_next_leaf() when inserting checksums
    btrfs: make checksum item extension more efficient
    btrfs: fix corrupt log due to concurrent fsync of inodes with shared extents
    btrfs: unexport btrfs_compress_set_level()
    btrfs: simplify iget helpers
    ...

    Linus Torvalds
     
  • Pull DAX updates part two from Darrick Wong:
    "This time around, we're hoisting the DONTCACHE flag from XFS into the
    VFS so that we can make the incore DAX mode changes become effective
    sooner.

    We can't change the file data access mode on a live inode because we
    don't have a safe way to change the file ops pointers. The incore
    state change becomes effective at inode loading time, which can happen
    if the inode is evicted. Therefore, we're making it so that
    filesystems can ask the VFS to evict the inode as soon as the last
    holder drops.

    The per-fs changes to make this call this will be in subsequent pull
    requests from Ted and myself.

    Summary:

    - Introduce DONTCACHE flags for dentries and inodes. This hint will
    cause the VFS to drop the associated objects immediately after the
    last put, so that we can change the file access mode (DAX or page
    cache) on the fly"

    * tag 'vfs-5.8-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
    fs: Introduce DCACHE_DONTCACHE
    fs: Lift XFS_IDONTCACHE to the VFS layer

    Linus Torvalds
     
  • Pull DAX updates part one from Darrick Wong:
    "After many years of LKML-wrangling about how to enable programs to
    query and influence the file data access mode (DAX) when a filesystem
    resides on storage devices such as persistent memory, Ira Weiny has
    emerged with a proposed set of standard behaviors that has not been
    shot down by anyone! We're more or less standardizing on the current
    XFS behavior and adapting ext4 to do the same.

    This is the first of a handful pull requests that will make ext4 and
    XFS present a consistent interface for user programs that care about
    DAX. We add a statx attribute that programs can check to see if DAX is
    enabled on a particular file. Then, we update the DAX documentation to
    spell out the user-visible behaviors that filesystems will guarantee
    (until the next storage industry shakeup). The on-disk inode flag has
    been in XFS for a few years now.

    Summary:

    - Clean up io_is_direct.

    - Add a new statx flag to indicate when file data access is being
    done via DAX (as opposed to the page cache).

    - Update the documentation for how system administrators and
    application programmers can take advantage of the (still
    experimental DAX) feature"

    Link: https://lore.kernel.org/lkml/20200505002016.1085071-1-ira.weiny@intel.com/

    * tag 'vfs-5.8-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
    Documentation/dax: Update Usage section
    fs/stat: Define DAX statx attribute
    fs: Remove unneeded IS_DAX() check in io_is_direct()

    Linus Torvalds
     
  • Pull xfs updates from Darrick Wong:
    "Most of the changes this cycle are refactoring of existing code in
    preparation for things landing in the future.

    We also fixed various problems and deficiencies in the quota
    implementation, and (I hope) the last of the stale read vectors by
    forcing write allocations to go through the unwritten state until the
    write completes.

    Summary:

    - Various cleanups to remove dead code, unnecessary conditionals,
    asserts, etc.

    - Fix a linker warning caused by xfs stuffing '-g' into CFLAGS
    redundantly.

    - Tighten up our dmesg logging to ensure that everything is prefixed
    with 'XFS' for easier grepping.

    - Kill a bunch of typedefs.

    - Refactor the deferred ops code to reduce indirect function calls.

    - Increase type-safety with the deferred ops code.

    - Make the DAX mount options a tri-state.

    - Fix some error handling problems in the inode flush code and clean
    up other inode flush warts.

    - Refactor log recovery so that each log item recovery functions now
    live with the other log item processing code.

    - Fix some SPDX forms.

    - Fix quota counter corruption if the fs crashes after running
    quotacheck but before any dquots get logged.

    - Don't fail metadata verification on zero-entry attr leaf blocks,
    since they're just part of the disk format now due to a historic
    lack of log atomicity.

    - Don't allow SWAPEXT between files with different [ugp]id when
    quotas are enabled.

    - Refactor inode fork reading and verification to run directly from
    the inode-from-disk function. This means that we now actually
    guarantee that _iget'ted inodes are totally verified and ready to
    go.

    - Move the incore inode fork format and extent counts to the ifork
    structure.

    - Scalability improvements by reducing cacheline pingponging in
    struct xfs_mount.

    - More scalability improvements by removing m_active_trans from the
    hot path.

    - Fix inode counter update sanity checking to run /only/ on debug
    kernels.

    - Fix longstanding inconsistency in what error code we return when a
    program hits project quota limits (ENOSPC).

    - Fix group quota returning the wrong error code when a program hits
    group quota limits.

    - Fix per-type quota limits and grace periods for group and project
    quotas so that they actually work.

    - Allow extension of individual grace periods.

    - Refactor the non-reclaim inode radix tree walking code to remove a
    bunch of stupid little functions and straighten out the
    inconsistent naming schemes.

    - Fix a bug in speculative preallocation where we measured a new
    allocation based on the last extent mapping in the file instead of
    looking farther for the last contiguous space allocation.

    - Force delalloc writes to unwritten extents. This closes a stale
    disk contents exposure vector if the system goes down before the
    write completes.

    - More lockdep whackamole"

    * tag 'xfs-5.8-merge-8' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (129 commits)
    xfs: more lockdep whackamole with kmem_alloc*
    xfs: force writes to delalloc regions to unwritten
    xfs: refactor xfs_iomap_prealloc_size
    xfs: measure all contiguous previous extents for prealloc size
    xfs: don't fail unwritten extent conversion on writeback due to edquot
    xfs: rearrange xfs_inode_walk_ag parameters
    xfs: straighten out all the naming around incore inode tree walks
    xfs: move xfs_inode_ag_iterator to be closer to the perag walking code
    xfs: use bool for done in xfs_inode_ag_walk
    xfs: fix inode ag walk predicate function return values
    xfs: refactor eofb matching into a single helper
    xfs: remove __xfs_icache_free_eofblocks
    xfs: remove flags argument from xfs_inode_ag_walk
    xfs: remove xfs_inode_ag_iterator_flags
    xfs: remove unused xfs_inode_ag_iterator function
    xfs: replace open-coded XFS_ICI_NO_TAG
    xfs: move eofblocks conversion function to xfs_ioctl.c
    xfs: allow individual quota grace period extension
    xfs: per-type quota timers and warn limits
    xfs: switch xfs_get_defquota to take explicit type
    ...

    Linus Torvalds
     
  • Pull io_uring updates from Jens Axboe:
    "A relatively quiet round, mostly just fixes and code improvements. In
    particular:

    - Make statx just use the generic statx handler, instead of open
    coding it. We don't need that anymore, as we always call it async
    safe (Bijan)

    - Enable closing of the ring itself. Also fixes O_PATH closure (me)

    - Properly name completion members (me)

    - Batch reap of dead file registrations (me)

    - Allow IORING_OP_POLL with double waitqueues (me)

    - Add tee(2) support (Pavel)

    - Remove double off read (Pavel)

    - Fix overflow cancellations (Pavel)

    - Improve CQ timeouts (Pavel)

    - Async defer drain fixes (Pavel)

    - Add support for enabling/disabling notifications on a registered
    eventfd (Stefano)

    - Remove dead state parameter (Xiaoguang)

    - Disable SQPOLL submit on dying ctx (Xiaoguang)

    - Various code cleanups"

    * tag 'for-5.8/io_uring-2020-06-01' of git://git.kernel.dk/linux-block: (29 commits)
    io_uring: fix overflowed reqs cancellation
    io_uring: off timeouts based only on completions
    io_uring: move timeouts flushing to a helper
    statx: hide interfaces no longer used by io_uring
    io_uring: call statx directly
    statx: allow system call to be invoked from io_uring
    io_uring: add io_statx structure
    io_uring: get rid of manual punting in io_close
    io_uring: separate DRAIN flushing into a cold path
    io_uring: don't re-read sqe->off in timeout_prep()
    io_uring: simplify io_timeout locking
    io_uring: fix flush req->refs underflow
    io_uring: don't submit sqes when ctx->refs is dying
    io_uring: async task poll trigger cleanup
    io_uring: add tee(2) support
    splice: export do_tee()
    io_uring: don't repeat valid flag list
    io_uring: rename io_file_put()
    io_uring: remove req->needs_fixed_files
    io_uring: cleanup io_poll_remove_one() logic
    ...

    Linus Torvalds
     
  • Pull block driver updates from Jens Axboe:
    "On top of the core changes, here are the block driver changes for this
    merge window:

    - NVMe changes:
    - NVMe over Fibre Channel protocol updates, which also reach
    over to drivers/scsi/lpfc (James Smart)
    - namespace revalidation support on the target (Anthony
    Iliopoulos)
    - gcc zero length array fix (Arnd Bergmann)
    - nvmet cleanups (Chaitanya Kulkarni)
    - misc cleanups and fixes (me, Keith Busch, Sagi Grimberg)
    - use a SRQ per completion vector (Max Gurtovoy)
    - fix handling of runtime changes to the queue count (Weiping
    Zhang)
    - t10 protection information support for nvme-rdma and
    nvmet-rdma (Israel Rukshin and Max Gurtovoy)
    - target side AEN improvements (Chaitanya Kulkarni)
    - various fixes and minor improvements all over, icluding the
    nvme part of the lpfc driver"

    - Floppy code cleanup series (Willy, Denis)

    - Floppy contention fix (Jiri)

    - Loop CONFIGURE support (Martijn)

    - bcache fixes/improvements (Coly, Joe, Colin)

    - q->queuedata cleanups (Christoph)

    - Get rid of ioctl_by_bdev (Christoph, Stefan)

    - md/raid5 allocation fixes (Coly)

    - zero length array fixes (Gustavo)

    - swim3 task state fix (Xu)"

    * tag 'for-5.8/drivers-2020-06-01' of git://git.kernel.dk/linux-block: (166 commits)
    bcache: configure the asynchronous registertion to be experimental
    bcache: asynchronous devices registration
    bcache: fix refcount underflow in bcache_device_free()
    bcache: Convert pr_ uses to a more typical style
    bcache: remove redundant variables i and n
    lpfc: Fix return value in __lpfc_nvme_ls_abort
    lpfc: fix axchg pointer reference after free and double frees
    lpfc: Fix pointer checks and comments in LS receive refactoring
    nvme: set dma alignment to qword
    nvmet: cleanups the loop in nvmet_async_events_process
    nvmet: fix memory leak when removing namespaces and controllers concurrently
    nvmet-rdma: add metadata/T10-PI support
    nvmet: add metadata support for block devices
    nvmet: add metadata/T10-PI support
    nvme: add Metadata Capabilities enumerations
    nvmet: rename nvmet_check_data_len to nvmet_check_transfer_len
    nvmet: rename nvmet_rw_len to nvmet_rw_data_len
    nvmet: add metadata characteristics for a namespace
    nvme-rdma: add metadata/T10-PI support
    nvme-rdma: introduce nvme_rdma_sgl structure
    ...

    Linus Torvalds
     
  • Pull block updates from Jens Axboe:
    "Core block changes that have been queued up for this release:

    - Remove dead blk-throttle and blk-wbt code (Guoqing)

    - Include pid in blktrace note traces (Jan)

    - Don't spew I/O errors on wouldblock termination (me)

    - Zone append addition (Johannes, Keith, Damien)

    - IO accounting improvements (Konstantin, Christoph)

    - blk-mq hardware map update improvements (Ming)

    - Scheduler dispatch improvement (Salman)

    - Inline block encryption support (Satya)

    - Request map fixes and improvements (Weiping)

    - blk-iocost tweaks (Tejun)

    - Fix for timeout failing with error injection (Keith)

    - Queue re-run fixes (Douglas)

    - CPU hotplug improvements (Christoph)

    - Queue entry/exit improvements (Christoph)

    - Move DMA drain handling to the few drivers that use it (Christoph)

    - Partition handling cleanups (Christoph)"

    * tag 'for-5.8/block-2020-06-01' of git://git.kernel.dk/linux-block: (127 commits)
    block: mark bio_wouldblock_error() bio with BIO_QUIET
    blk-wbt: rename __wbt_update_limits to wbt_update_limits
    blk-wbt: remove wbt_update_limits
    blk-throttle: remove tg_drain_bios
    blk-throttle: remove blk_throtl_drain
    null_blk: force complete for timeout request
    blk-mq: drain I/O when all CPUs in a hctx are offline
    blk-mq: add blk_mq_all_tag_iter
    blk-mq: open code __blk_mq_alloc_request in blk_mq_alloc_request_hctx
    blk-mq: use BLK_MQ_NO_TAG in more places
    blk-mq: rename BLK_MQ_TAG_FAIL to BLK_MQ_NO_TAG
    blk-mq: move more request initialization to blk_mq_rq_ctx_init
    blk-mq: simplify the blk_mq_get_request calling convention
    blk-mq: remove the bio argument to ->prepare_request
    nvme: force complete cancelled requests
    blk-mq: blk-mq: provide forced completion method
    block: fix a warning when blkdev.h is included for !CONFIG_BLOCK builds
    block: blk-crypto-fallback: remove redundant initialization of variable err
    block: reduce part_stat_lock() scope
    block: use __this_cpu_add() instead of access by smp_processor_id()
    ...

    Linus Torvalds
     
  • Pull power management updates from Rafael Wysocki:
    "These rework the system-wide PM driver flags, make runtime switching
    of cpuidle governors easier, improve the user space hibernation
    interface code, add intel-speed-select interface documentation, add
    more debug messages to the ACPI code handling suspend to idle, update
    the cpufreq core and drivers, fix a minor issue in the cpuidle core
    and update two cpuidle drivers, improve the PM-runtime framework,
    update the Intel RAPL power capping driver, update devfreq core and
    drivers, and clean up the cpupower utility.

    Specifics:

    - Rework the system-wide PM driver flags to make them easier to
    understand and use and update their documentation (Rafael Wysocki,
    Alan Stern).

    - Allow cpuidle governors to be switched at run time regardless of
    the kernel configuration and update the related documentation
    accordingly (Hanjun Guo).

    - Improve the resume device handling in the user space hibernarion
    interface code (Domenico Andreoli).

    - Document the intel-speed-select sysfs interface (Srinivas
    Pandruvada).

    - Make the ACPI code handing suspend to idle print more debug
    messages to help diagnose issues with it (Rafael Wysocki).

    - Fix a helper routine in the cpufreq core and correct a typo in the
    struct cpufreq_driver kerneldoc comment (Rafael Wysocki, Wang
    Wenhu).

    - Update cpufreq drivers:

    - Make the intel_pstate driver start in the passive mode by
    default on systems without HWP (Rafael Wysocki).

    - Add i.MX7ULP support to the imx-cpufreq-dt driver and add
    i.MX7ULP to the cpufreq-dt-platdev blacklist (Peng Fan).

    - Convert the qoriq cpufreq driver to a platform one, make the
    platform code create a suitable device object for it and add
    platform dependencies to it (Mian Yousaf Kaukab, Geert
    Uytterhoeven).

    - Fix wrong compatible binding in the qcom driver (Ansuel Smith).

    - Build the omap driver by default for ARCH_OMAP2PLUS (Anders
    Roxell).

    - Add r8a7742 SoC support to the dt cpufreq driver (Lad
    Prabhakar).

    - Update cpuidle core and drivers:

    - Fix three reference count leaks in error code paths in the
    cpuidle core (Qiushi Wu).

    - Convert Qualcomm SPM to a generic cpuidle driver (Stephan
    Gerhold).

    - Fix up the execution order when entering a domain idle state in
    the PSCI driver (Ulf Hansson).

    - Fix a reference counting issue related to clock management and
    clean up two oddities in the PM-runtime framework (Rafael Wysocki,
    Andy Shevchenko).

    - Add ElkhartLake support to the Intel RAPL power capping driver and
    remove an unused local MSR definition from it (Jacob Pan, Sumeet
    Pawnikar).

    - Update devfreq core and drivers:

    - Replace strncpy() with strscpy() in the devfreq core and use
    lockdep asserts instead of manual checks for a locked mutex in
    it (Dmitry Osipenko, Krzysztof Kozlowski).

    - Add a generic imx bus scaling driver and make it register an
    interconnect device (Leonard Crestez, Gustavo A. R. Silva).

    - Make the cpufreq notifier in the tegra30 driver take boosting
    into account and delete an unuseful error message from that
    driver (Dmitry Osipenko, Markus Elfring).

    - Remove unneeded semicolon from the cpupower code (Zou Wei)"

    * tag 'pm-5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (51 commits)
    cpuidle: Fix three reference count leaks
    PM: runtime: Replace pm_runtime_callbacks_present()
    PM / devfreq: Use lockdep asserts instead of manual checks for locked mutex
    PM / devfreq: imx-bus: Fix inconsistent IS_ERR and PTR_ERR
    PM / devfreq: Replace strncpy with strscpy
    PM / devfreq: imx: Register interconnect device
    PM / devfreq: Add generic imx bus scaling driver
    PM / devfreq: tegra30: Delete an error message in tegra_devfreq_probe()
    PM / devfreq: tegra30: Make CPUFreq notifier to take into account boosting
    PM: hibernate: Restrict writes to the resume device
    PM: runtime: clk: Fix clk_pm_runtime_get() error path
    cpuidle: Convert Qualcomm SPM driver to a generic CPUidle driver
    ACPI: EC: PM: s2idle: Extend GPE dispatching debug message
    ACPI: PM: s2idle: Print type of wakeup debug messages
    powercap: RAPL: remove unused local MSR define
    PM: runtime: Make clear what we do when conditions are wrong in rpm_suspend()
    Documentation: admin-guide: pm: Document intel-speed-select
    PM: hibernate: Split off snapshot dev option
    PM: hibernate: Incorporate concurrency handling
    Documentation: ABI: make current_governer_ro as a candidate for removal
    ...

    Linus Torvalds
     
  • Merge updates from Andrew Morton:
    "A few little subsystems and a start of a lot of MM patches.

    Subsystems affected by this patch series: squashfs, ocfs2, parisc,
    vfs. With mm subsystems: slab-generic, slub, debug, pagecache, gup,
    swap, memcg, pagemap, memory-failure, vmalloc, kasan"

    * emailed patches from Andrew Morton : (128 commits)
    kasan: move kasan_report() into report.c
    mm/mm_init.c: report kasan-tag information stored in page->flags
    ubsan: entirely disable alignment checks under UBSAN_TRAP
    kasan: fix clang compilation warning due to stack protector
    x86/mm: remove vmalloc faulting
    mm: remove vmalloc_sync_(un)mappings()
    x86/mm/32: implement arch_sync_kernel_mappings()
    x86/mm/64: implement arch_sync_kernel_mappings()
    mm/ioremap: track which page-table levels were modified
    mm/vmalloc: track which page-table levels were modified
    mm: add functions to track page directory modifications
    s390: use __vmalloc_node in stack_alloc
    powerpc: use __vmalloc_node in alloc_vm_stack
    arm64: use __vmalloc_node in arch_alloc_vmap_stack
    mm: remove vmalloc_user_node_flags
    mm: switch the test_vmalloc module to use __vmalloc_node
    mm: remove __vmalloc_node_flags_caller
    mm: remove both instances of __vmalloc_node_flags
    mm: remove the prot argument to __vmalloc_node
    mm: remove the pgprot argument to __vmalloc
    ...

    Linus Torvalds
     
  • The pgprot argument to __vmalloc is always PAGE_KERNEL now, so remove it.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Reviewed-by: Michael Kelley [hyperv]
    Acked-by: Gao Xiang [erofs]
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: Wei Liu
    Cc: Christian Borntraeger
    Cc: Christophe Leroy
    Cc: Daniel Vetter
    Cc: David Airlie
    Cc: Greg Kroah-Hartman
    Cc: Haiyang Zhang
    Cc: Johannes Weiner
    Cc: "K. Y. Srinivasan"
    Cc: Laura Abbott
    Cc: Mark Rutland
    Cc: Minchan Kim
    Cc: Nitin Gupta
    Cc: Robin Murphy
    Cc: Sakari Ailus
    Cc: Stephen Hemminger
    Cc: Sumit Semwal
    Cc: Benjamin Herrenschmidt
    Cc: Catalin Marinas
    Cc: Heiko Carstens
    Cc: Paul Mackerras
    Cc: Vasily Gorbik
    Cc: Will Deacon
    Link: http://lkml.kernel.org/r/20200414131348.444715-22-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • This is always PAGE_KERNEL - for long term mappings with other properties
    vmap should be used.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Acked-by: Peter Zijlstra (Intel)
    Cc: Christian Borntraeger
    Cc: Christophe Leroy
    Cc: Daniel Vetter
    Cc: David Airlie
    Cc: Gao Xiang
    Cc: Greg Kroah-Hartman
    Cc: Haiyang Zhang
    Cc: Johannes Weiner
    Cc: "K. Y. Srinivasan"
    Cc: Laura Abbott
    Cc: Mark Rutland
    Cc: Michael Kelley
    Cc: Minchan Kim
    Cc: Nitin Gupta
    Cc: Robin Murphy
    Cc: Sakari Ailus
    Cc: Stephen Hemminger
    Cc: Sumit Semwal
    Cc: Wei Liu
    Cc: Benjamin Herrenschmidt
    Cc: Catalin Marinas
    Cc: Heiko Carstens
    Cc: Paul Mackerras
    Cc: Vasily Gorbik
    Cc: Will Deacon
    Link: http://lkml.kernel.org/r/20200414131348.444715-19-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Now, when reading /proc/PID/smaps, the PMD migration entry in page table
    is simply ignored. To improve the accuracy of /proc/PID/smaps, its
    parsing and processing is added.

    To test the patch, we run pmbench to eat 400 MB memory in background,
    then run /usr/bin/migratepages and `cat /proc/PID/smaps` every second.
    The issue as follows can be reproduced within 60 seconds.

    Before the patch, for the fully populated 400 MB anonymous VMA, some THP
    pages under migration may be lost as below.

    7f3f6a7e5000-7f3f837e5000 rw-p 00000000 00:00 0
    Size: 409600 kB
    KernelPageSize: 4 kB
    MMUPageSize: 4 kB
    Rss: 407552 kB
    Pss: 407552 kB
    Shared_Clean: 0 kB
    Shared_Dirty: 0 kB
    Private_Clean: 0 kB
    Private_Dirty: 407552 kB
    Referenced: 301056 kB
    Anonymous: 407552 kB
    LazyFree: 0 kB
    AnonHugePages: 405504 kB
    ShmemPmdMapped: 0 kB
    FilePmdMapped: 0 kB
    Shared_Hugetlb: 0 kB
    Private_Hugetlb: 0 kB
    Swap: 0 kB
    SwapPss: 0 kB
    Locked: 0 kB
    THPeligible: 1
    VmFlags: rd wr mr mw me ac

    After the patch, it will be always,

    7f3f6a7e5000-7f3f837e5000 rw-p 00000000 00:00 0
    Size: 409600 kB
    KernelPageSize: 4 kB
    MMUPageSize: 4 kB
    Rss: 409600 kB
    Pss: 409600 kB
    Shared_Clean: 0 kB
    Shared_Dirty: 0 kB
    Private_Clean: 0 kB
    Private_Dirty: 409600 kB
    Referenced: 294912 kB
    Anonymous: 409600 kB
    LazyFree: 0 kB
    AnonHugePages: 407552 kB
    ShmemPmdMapped: 0 kB
    FilePmdMapped: 0 kB
    Shared_Hugetlb: 0 kB
    Private_Hugetlb: 0 kB
    Swap: 0 kB
    SwapPss: 0 kB
    Locked: 0 kB
    THPeligible: 1
    VmFlags: rd wr mr mw me ac

    Signed-off-by: "Huang, Ying"
    Signed-off-by: Andrew Morton
    Reviewed-by: Zi Yan
    Acked-by: Michal Hocko
    Acked-by: Kirill A. Shutemov
    Acked-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Alexey Dobriyan
    Cc: Konstantin Khlebnikov
    Cc: "Jérôme Glisse"
    Cc: Yang Shi
    Link: http://lkml.kernel.org/r/20200403123059.1846960-1-ying.huang@intel.com
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • After an NFS page has been written it is considered "unstable" until a
    COMMIT request succeeds. If the COMMIT fails, the page will be
    re-written.

    These "unstable" pages are currently accounted as "reclaimable", either
    in WB_RECLAIMABLE, or in NR_UNSTABLE_NFS which is included in a
    'reclaimable' count. This might have made sense when sending the COMMIT
    required a separate action by the VFS/MM (e.g. releasepage() used to
    send a COMMIT). However now that all writes generated by ->writepages()
    will automatically be followed by a COMMIT (since commit 919e3bd9a875
    ("NFS: Ensure we commit after writeback is complete")) it makes more
    sense to treat them as writeback pages.

    So this patch removes NR_UNSTABLE_NFS and accounts unstable pages in
    NR_WRITEBACK and WB_WRITEBACK.

    A particular effect of this change is that when
    wb_check_background_flush() calls wb_over_bg_threshold(), the latter
    will report 'true' a lot less often as the 'unstable' pages are no
    longer considered 'dirty' (as there is nothing that writeback can do
    about them anyway).

    Currently wb_check_background_flush() will trigger writeback to NFS even
    when there are relatively few dirty pages (if there are lots of unstable
    pages), this can result in small writes going to the server (10s of
    Kilobytes rather than a Megabyte) which hurts throughput. With this
    patch, there are fewer writes which are each larger on average.

    Where the NR_UNSTABLE_NFS count was included in statistics
    virtual-files, the entry is retained, but the value is hard-coded as
    zero. static trace points and warning printks which mentioned this
    counter no longer report it.

    [akpm@linux-foundation.org: re-layout comment]
    [akpm@linux-foundation.org: fix printk warning]
    Signed-off-by: NeilBrown
    Signed-off-by: Andrew Morton
    Reviewed-by: Jan Kara
    Reviewed-by: Christoph Hellwig
    Acked-by: Trond Myklebust
    Acked-by: Michal Hocko [mm]
    Cc: Christoph Hellwig
    Cc: Chuck Lever
    Link: http://lkml.kernel.org/r/87d06j7gqa.fsf@notabene.neil.brown.name
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • PF_LESS_THROTTLE exists for loop-back nfsd (and a similar need in the
    loop block driver and callers of prctl(PR_SET_IO_FLUSHER)), where a
    daemon needs to write to one bdi (the final bdi) in order to free up
    writes queued to another bdi (the client bdi).

    The daemon sets PF_LESS_THROTTLE and gets a larger allowance of dirty
    pages, so that it can still dirty pages after other processses have been
    throttled. The purpose of this is to avoid deadlock that happen when
    the PF_LESS_THROTTLE process must write for any dirty pages to be freed,
    but it is being thottled and cannot write.

    This approach was designed when all threads were blocked equally,
    independently on which device they were writing to, or how fast it was.
    Since that time the writeback algorithm has changed substantially with
    different threads getting different allowances based on non-trivial
    heuristics. This means the simple "add 25%" heuristic is no longer
    reliable.

    The important issue is not that the daemon needs a *larger* dirty page
    allowance, but that it needs a *private* dirty page allowance, so that
    dirty pages for the "client" bdi that it is helping to clear (the bdi
    for an NFS filesystem or loop block device etc) do not affect the
    throttling of the daemon writing to the "final" bdi.

    This patch changes the heuristic so that the task is not throttled when
    the bdi it is writing to has a dirty page count below below (or equal
    to) the free-run threshold for that bdi. This ensures it will always be
    able to have some pages in flight, and so will not deadlock.

    In a steady-state, it is expected that PF_LOCAL_THROTTLE tasks might
    still be throttled by global threshold, but that is acceptable as it is
    only the deadlock state that is interesting for this flag.

    This approach of "only throttle when target bdi is busy" is consistent
    with the other use of PF_LESS_THROTTLE in current_may_throttle(), were
    it causes attention to be focussed only on the target bdi.

    So this patch
    - renames PF_LESS_THROTTLE to PF_LOCAL_THROTTLE,
    - removes the 25% bonus that that flag gives, and
    - If PF_LOCAL_THROTTLE is set, don't delay at all unless the
    global and the local free-run thresholds are exceeded.

    Note that previously realtime threads were treated the same as
    PF_LESS_THROTTLE threads. This patch does *not* change the behvaiour
    for real-time threads, so it is now different from the behaviour of nfsd
    and loop tasks. I don't know what is wanted for realtime.

    [akpm@linux-foundation.org: coding style fixes]
    Signed-off-by: NeilBrown
    Signed-off-by: Andrew Morton
    Reviewed-by: Jan Kara
    Acked-by: Chuck Lever [nfsd]
    Cc: Christoph Hellwig
    Cc: Michal Hocko
    Cc: Trond Myklebust
    Link: http://lkml.kernel.org/r/87ftbf7gs3.fsf@notabene.neil.brown.name
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • Since the new pair function is introduced, we can call them to clean the
    code in orangefs.

    Signed-off-by: Guoqing Jiang
    Signed-off-by: Andrew Morton
    Tested-by: Mike Marshall
    Reviewed-by: Andrew Morton
    Cc: Martin Brandenburg
    Link: http://lkml.kernel.org/r/20200517214718.468-9-guoqing.jiang@cloud.ionos.com
    Signed-off-by: Linus Torvalds

    Guoqing Jiang
     
  • Call the new function since attach_page_buffers will be removed.

    Signed-off-by: Guoqing Jiang
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Cc: Anton Altaparmakov
    Link: http://lkml.kernel.org/r/20200517214718.468-8-guoqing.jiang@cloud.ionos.com
    Signed-off-by: Linus Torvalds

    Guoqing Jiang
     
  • Since the new pair function is introduced, we can call them to clean the
    code in iomap.

    Signed-off-by: Guoqing Jiang
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Reviewed-by: Darrick J. Wong
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Link: http://lkml.kernel.org/r/20200517214718.468-7-guoqing.jiang@cloud.ionos.com
    Signed-off-by: Linus Torvalds

    Guoqing Jiang
     
  • Since the new pair function is introduced, we can call them to clean the
    code in f2fs.h.

    Signed-off-by: Guoqing Jiang
    Signed-off-by: Andrew Morton
    Acked-by: Chao Yu
    Cc: Jaegeuk Kim
    Link: http://lkml.kernel.org/r/20200517214718.468-6-guoqing.jiang@cloud.ionos.com
    Signed-off-by: Linus Torvalds

    Guoqing Jiang