30 Dec, 2020

1 commit

  • [ Upstream commit 88149082bb8ef31b289673669e080ec6a00c2e59 ]

    If generic_drop_inode() returns true, it means iput_final() can evict
    this inode regardless of whether it is dirty or not. If we check
    I_DONTCACHE in generic_drop_inode(), any inode with this bit set will be
    evicted unconditionally. This is not the desired behavior because
    I_DONTCACHE only means the inode shouldn't be cached on the LRU list.
    As for whether we need to evict this inode, this is what
    generic_drop_inode() should do. This patch corrects the usage of
    I_DONTCACHE.

    This patch was proposed in [1].

    [1]: https://lore.kernel.org/linux-fsdevel/20200831003407.GE12096@dread.disaster.area/

    Fixes: dae2f8ed7992 ("fs: Lift XFS_IDONTCACHE to the VFS layer")
    Signed-off-by: Hao Li
    Reviewed-by: Dave Chinner
    Reviewed-by: Ira Weiny
    Signed-off-by: Al Viro
    Signed-off-by: Sasha Levin

    Hao Li
     

17 Oct, 2020

1 commit

  • The page cache needs to know whether the filesystem supports THPs so that
    it doesn't send THPs to filesystems which can't handle them. Dave Chinner
    points out that getting from the page mapping to the filesystem type is
    too many steps (mapping->host->i_sb->s_type->fs_flags) so cache that
    information in the address space flags.

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Cc: Alexander Viro
    Cc: "Matthew Wilcox (Oracle)"
    Cc: Hugh Dickins
    Cc: Song Liu
    Cc: Rik van Riel
    Cc: "Kirill A . Shutemov"
    Cc: Johannes Weiner
    Cc: Dave Chinner
    Cc: Christoph Hellwig
    Link: https://lkml.kernel.org/r/20200916032717.22917-1-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

06 Jun, 2020

1 commit

  • Pull AFS updates from David Howells:
    "There's some core VFS changes which affect a couple of filesystems:

    - Make the inode hash table RCU safe and providing some RCU-safe
    accessor functions. The search can then be done without taking the
    inode_hash_lock. Care must be taken because the object may be being
    deleted and no wait is made.

    - Allow iunique() to avoid taking the inode_hash_lock.

    - Allow AFS's callback processing to avoid taking the inode_hash_lock
    when using the inode table to find an inode to notify.

    - Improve Ext4's time updating. Konstantin Khlebnikov said "For now,
    I've plugged this issue with try-lock in ext4 lazy time update.
    This solution is much better."

    Then there's a set of changes to make a number of improvements to the
    AFS driver:

    - Improve callback (ie. third party change notification) processing
    by:

    (a) Relying more on the fact we're doing this under RCU and by
    using fewer locks. This makes use of the RCU-based inode
    searching outlined above.

    (b) Moving to keeping volumes in a tree indexed by volume ID
    rather than a flat list.

    (c) Making the server and volume records logically part of the
    cell. This means that a server record now points directly at
    the cell and the tree of volumes is there. This removes an N:M
    mapping table, simplifying things.

    - Improve keeping NAT or firewall channels open for the server
    callbacks to reach the client by actively polling the fileserver on
    a timed basis, instead of only doing it when we have an operation
    to process.

    - Improving detection of delayed or lost callbacks by including the
    parent directory in the list of file IDs to be queried when doing a
    bulk status fetch from lookup. We can then check to see if our copy
    of the directory has changed under us without us getting notified.

    - Determine aliasing of cells (such as a cell that is pointed to be a
    DNS alias). This allows us to avoid having ambiguity due to
    apparently different cells using the same volume and file servers.

    - Improve the fileserver rotation to do more probing when it detects
    that all of the addresses to a server are listed as non-responsive.
    It's possible that an address that previously stopped responding
    has become responsive again.

    Beyond that, lay some foundations for making some calls asynchronous:

    - Turn the fileserver cursor struct into a general operation struct
    and hang the parameters off of that rather than keeping them in
    local variables and hang results off of that rather than the call
    struct.

    - Implement some general operation handling code and simplify the
    callers of operations that affect a volume or a volume component
    (such as a file). Most of the operation is now done by core code.

    - Operations are supplied with a table of operations to issue
    different variants of RPCs and to manage the completion, where all
    the required data is held in the operation object, thereby allowing
    these to be called from a workqueue.

    - Put the standard "if (begin), while(select), call op, end" sequence
    into a canned function that just emulates the current behaviour for
    now.

    There are also some fixes interspersed:

    - Don't let the EACCES from ICMP6 mapping reach the user as such,
    since it's confusing as to whether it's a filesystem error. Convert
    it to EHOSTUNREACH.

    - Don't use the epoch value acquired through probing a server. If we
    have two servers with the same UUID but in different cells, it's
    hard to draw conclusions from them having different epoch values.

    - Don't interpret the argument to the CB.ProbeUuid RPC as a
    fileserver UUID and look up a fileserver from it.

    - Deal with servers in different cells having the same UUIDs. In the
    event that a CB.InitCallBackState3 RPC is received, we have to
    break the callback promises for every server record matching that
    UUID.

    - Don't let afs_statfs return values that go below 0.

    - Don't use running fileserver probe state to make server selection
    and address selection decisions on. Only make decisions on final
    state as the running state is cleared at the start of probing"

    Acked-by: Al Viro (fs/inode.c part)

    * tag 'afs-next-20200604' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: (27 commits)
    afs: Adjust the fileserver rotation algorithm to reprobe/retry more quickly
    afs: Show more a bit more server state in /proc/net/afs/servers
    afs: Don't use probe running state to make decisions outside probe code
    afs: Fix afs_statfs() to not let the values go below zero
    afs: Fix the by-UUID server tree to allow servers with the same UUID
    afs: Reorganise volume and server trees to be rooted on the cell
    afs: Add a tracepoint to track the lifetime of the afs_volume struct
    afs: Detect cell aliases 3 - YFS Cells with a canonical cell name op
    afs: Detect cell aliases 2 - Cells with no root volumes
    afs: Detect cell aliases 1 - Cells with root volumes
    afs: Implement client support for the YFSVL.GetCellName RPC op
    afs: Retain more of the VLDB record for alias detection
    afs: Fix handling of CB.ProbeUuid cache manager op
    afs: Don't get epoch from a server because it may be ambiguous
    afs: Build an abstraction around an "operation" concept
    afs: Rename struct afs_fs_cursor to afs_operation
    afs: Remove the error argument from afs_protocol_error()
    afs: Set error flag rather than return error from file status decode
    afs: Make callback processing more efficient.
    afs: Show more information in /proc/net/afs/servers
    ...

    Linus Torvalds
     

04 Jun, 2020

1 commit

  • Pull networking updates from David Miller:

    1) Allow setting bluetooth L2CAP modes via socket option, from Luiz
    Augusto von Dentz.

    2) Add GSO partial support to igc, from Sasha Neftin.

    3) Several cleanups and improvements to r8169 from Heiner Kallweit.

    4) Add IF_OPER_TESTING link state and use it when ethtool triggers a
    device self-test. From Andrew Lunn.

    5) Start moving away from custom driver versions, use the globally
    defined kernel version instead, from Leon Romanovsky.

    6) Support GRO vis gro_cells in DSA layer, from Alexander Lobakin.

    7) Allow hard IRQ deferral during NAPI, from Eric Dumazet.

    8) Add sriov and vf support to hinic, from Luo bin.

    9) Support Media Redundancy Protocol (MRP) in the bridging code, from
    Horatiu Vultur.

    10) Support netmap in the nft_nat code, from Pablo Neira Ayuso.

    11) Allow UDPv6 encapsulation of ESP in the ipsec code, from Sabrina
    Dubroca. Also add ipv6 support for espintcp.

    12) Lots of ReST conversions of the networking documentation, from Mauro
    Carvalho Chehab.

    13) Support configuration of ethtool rxnfc flows in bcmgenet driver,
    from Doug Berger.

    14) Allow to dump cgroup id and filter by it in inet_diag code, from
    Dmitry Yakunin.

    15) Add infrastructure to export netlink attribute policies to
    userspace, from Johannes Berg.

    16) Several optimizations to sch_fq scheduler, from Eric Dumazet.

    17) Fallback to the default qdisc if qdisc init fails because otherwise
    a packet scheduler init failure will make a device inoperative. From
    Jesper Dangaard Brouer.

    18) Several RISCV bpf jit optimizations, from Luke Nelson.

    19) Correct the return type of the ->ndo_start_xmit() method in several
    drivers, it's netdev_tx_t but many drivers were using
    'int'. From Yunjian Wang.

    20) Add an ethtool interface for PHY master/slave config, from Oleksij
    Rempel.

    21) Add BPF iterators, from Yonghang Song.

    22) Add cable test infrastructure, including ethool interfaces, from
    Andrew Lunn. Marvell PHY driver is the first to support this
    facility.

    23) Remove zero-length arrays all over, from Gustavo A. R. Silva.

    24) Calculate and maintain an explicit frame size in XDP, from Jesper
    Dangaard Brouer.

    25) Add CAP_BPF, from Alexei Starovoitov.

    26) Support terse dumps in the packet scheduler, from Vlad Buslov.

    27) Support XDP_TX bulking in dpaa2 driver, from Ioana Ciornei.

    28) Add devm_register_netdev(), from Bartosz Golaszewski.

    29) Minimize qdisc resets, from Cong Wang.

    30) Get rid of kernel_getsockopt and kernel_setsockopt in order to
    eliminate set_fs/get_fs calls. From Christoph Hellwig.

    * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2517 commits)
    selftests: net: ip_defrag: ignore EPERM
    net_failover: fixed rollback in net_failover_open()
    Revert "tipc: Fix potential tipc_aead refcnt leak in tipc_crypto_rcv"
    Revert "tipc: Fix potential tipc_node refcnt leak in tipc_rcv"
    vmxnet3: allow rx flow hash ops only when rss is enabled
    hinic: add set_channels ethtool_ops support
    selftests/bpf: Add a default $(CXX) value
    tools/bpf: Don't use $(COMPILE.c)
    bpf, selftests: Use bpf_probe_read_kernel
    s390/bpf: Use bcr 0,%0 as tail call nop filler
    s390/bpf: Maintain 8-byte stack alignment
    selftests/bpf: Fix verifier test
    selftests/bpf: Fix sample_cnt shared between two threads
    bpf, selftests: Adapt cls_redirect to call csum_level helper
    bpf: Add csum_level helper for fixing up csum levels
    bpf: Fix up bpf_skb_adjust_room helper's skb csum setting
    sfc: add missing annotation for efx_ef10_try_update_nic_stats_vf()
    crypto/chtls: IPv6 support for inline TLS
    Crypto/chcr: Fixes a coccinile check error
    Crypto/chcr: Fixes compilations warnings
    ...

    Linus Torvalds
     

31 May, 2020

1 commit

  • Make the inode hash table RCU searchable so that searches that want to
    access or modify an inode without taking a ref on that inode can do so
    without taking the inode hash table lock.

    The main thing this requires is some RCU annotation on the list
    manipulation operations. Inodes are already freed by RCU in most cases.

    Users of this interface must take care as the inode may be still under
    construction or may be being torn down around them.

    There are at least three instances where this can be of use:

    (1) Testing whether the inode number iunique() is going to return is
    currently unique (the iunique_lock is still held).

    (2) Ext4 date stamp updating.

    (3) AFS callback breaking.

    Signed-off-by: David Howells
    Acked-by: Konstantin Khlebnikov
    cc: linux-ext4@vger.kernel.org
    cc: linux-afs@lists.infradead.org

    David Howells
     

27 Apr, 2020

1 commit

  • Instead of having all the sysctl handlers deal with user pointers, which
    is rather hairy in terms of the BPF interaction, copy the input to and
    from userspace in common code. This also means that the strings are
    always NUL-terminated by the common code, making the API a little bit
    safer.

    As most handler just pass through the data to one of the common handlers
    a lot of the changes are mechnical.

    Signed-off-by: Christoph Hellwig
    Acked-by: Andrey Ignatov
    Signed-off-by: Al Viro

    Christoph Hellwig
     

21 Apr, 2020

1 commit

  • Use *foo makes the toolchain to think that this is an emphasis, causing
    those warnings:

    ./fs/inode.c:1609: WARNING: Inline emphasis start-string without end-string.
    ./fs/inode.c:1609: WARNING: Inline emphasis start-string without end-string.
    ./fs/inode.c:1615: WARNING: Inline emphasis start-string without end-string.

    So, use, instead, ``*foo``, in order to mark it as a literal block.

    Signed-off-by: Mauro Carvalho Chehab
    Link: https://lore.kernel.org/r/e8da46a0e57f2af6d63a0c53665495075698e28a.1586881715.git.mchehab+huawei@kernel.org
    Signed-off-by: Jonathan Corbet

    Mauro Carvalho Chehab
     

06 Mar, 2020

1 commit

  • As reported by Jann, ihold() does not in fact guarantee inode
    persistence. And instead of making it so, replace the usage of inode
    pointers with a per boot, machine wide, unique inode identifier.

    This sequence number is global, but shared (file backed) futexes are
    rare enough that this should not become a performance issue.

    Reported-by: Jann Horn
    Suggested-by: Linus Torvalds
    Signed-off-by: Peter Zijlstra (Intel)

    Peter Zijlstra
     

09 Feb, 2020

1 commit

  • Pull misc vfs updates from Al Viro:

    - bmap series from cmaiolino

    - getting rid of convolutions in copy_mount_options() (use a couple of
    copy_from_user() instead of the __get_user() crap)

    * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    saner copy_mount_options()
    fibmap: Reject negative block numbers
    fibmap: Use bmap instead of ->bmap method in ioctl_fibmap
    ecryptfs: drop direct calls to ->bmap
    cachefiles: drop direct usage of ->bmap method.
    fs: Enable bmap() function to properly return errors

    Linus Torvalds
     

05 Feb, 2020

1 commit

  • Pull vfs timestamp updates from Al Viro:
    "More 64bit timestamp work"

    * 'imm.timestamp' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    kernfs: don't bother with timestamp truncation
    fs: Do not overload update_time
    fs: Delete timespec64_trunc()
    fs: ubifs: Eliminate timespec64_trunc() usage
    fs: ceph: Delete timespec64_trunc() usage
    fs: cifs: Delete usage of timespec64_trunc
    fs: fat: Eliminate timespec64_trunc() usage
    utimes: Clamp the timestamps in notify_change()

    Linus Torvalds
     

03 Feb, 2020

1 commit

  • By now, bmap() will either return the physical block number related to
    the requested file offset or 0 in case of error or the requested offset
    maps into a hole.
    This patch makes the needed changes to enable bmap() to proper return
    errors, using the return value as an error return, and now, a pointer
    must be passed to bmap() to be filled with the mapped physical block.

    It will change the behavior of bmap() on return:

    - negative value in case of error
    - zero on success or map fell into a hole

    In case of a hole, the *block will be zero too

    Since this is a prep patch, by now, the only error return is -EINVAL if
    ->bmap doesn't exist.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Carlos Maiolino
    Signed-off-by: Al Viro

    Carlos Maiolino
     

23 Jan, 2020

1 commit

  • Casefolded encrypted directories will use a new dirhash method that
    requires a secret key. If the directory uses a v2 encryption policy,
    it's easy to derive this key from the master key using HKDF. However,
    v1 encryption policies don't provide a way to derive additional keys.

    Therefore, don't allow casefolding on directories that use a v1 policy.
    Specifically, make it so that trying to enable casefolding on a
    directory that has a v1 policy fails, trying to set a v1 policy on a
    casefolded directory fails, and trying to open a casefolded directory
    that has a v1 policy (if one somehow exists on-disk) fails.

    Signed-off-by: Daniel Rosenberg
    [EB: improved commit message, updated fscrypt.rst, and other cleanups]
    Link: https://lore.kernel.org/r/20200120223201.241390-2-ebiggers@kernel.org
    Signed-off-by: Eric Biggers

    Daniel Rosenberg
     

18 Dec, 2019

1 commit

  • Anything that walks all inodes on sb->s_inodes list without rescheduling
    risks softlockups.

    Previous efforts were made in 2 functions, see:

    c27d82f fs/drop_caches.c: avoid softlockups in drop_pagecache_sb()
    ac05fbb inode: don't softlockup when evicting inodes

    but there hasn't been an audit of all walkers, so do that now. This
    also consistently moves the cond_resched() calls to the bottom of each
    loop in cases where it already exists.

    One loop remains: remove_dquot_ref(), because I'm not quite sure how
    to deal with that one w/o taking the i_lock.

    Signed-off-by: Eric Sandeen
    Reviewed-by: Jan Kara
    Signed-off-by: Al Viro

    Eric Sandeen
     

09 Dec, 2019

2 commits


25 Sep, 2019

1 commit

  • In previous patch, an application could put part of its text section in
    THP via madvise(). These THPs will be protected from writes when the
    application is still running (TXTBSY). However, after the application
    exits, the file is available for writes.

    This patch avoids writes to file THP by dropping page cache for the file
    when the file is open for write. A new counter nr_thps is added to struct
    address_space. In do_dentry_open(), if the file is open for write and
    nr_thps is non-zero, we drop page cache for the whole file.

    Link: http://lkml.kernel.org/r/20190801184244.3169074-8-songliubraving@fb.com
    Signed-off-by: Song Liu
    Reported-by: kbuild test robot
    Acked-by: Rik van Riel
    Acked-by: Kirill A. Shutemov
    Acked-by: Johannes Weiner
    Cc: Hillf Danton
    Cc: Hugh Dickins
    Cc: William Kucharski
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Song Liu
     

30 Aug, 2019

1 commit

  • timespec_trunc() function is used to truncate a
    filesystem timestamp to the right granularity.
    But, the function does not clamp tv_sec part of the
    timestamps according to the filesystem timestamp limits.

    The replacement api: timestamp_truncate() also alters the
    signature of the function to accommodate filesystem
    timestamp clamping according to flesystem limits.

    Note that the tv_nsec part is set to 0 if tv_sec is not within
    the range supported for the filesystem.

    Signed-off-by: Deepa Dinamani
    Acked-by: Jeff Layton

    Deepa Dinamani
     

13 Jul, 2019

1 commit

  • Pull common SETFLAGS/FSSETXATTR parameter checking from Darrick Wong:
    "Here's a patch series that sets up common parameter checking functions
    for the FS_IOC_SETFLAGS and FS_IOC_FSSETXATTR ioctl implementations.

    The goal here is to reduce the amount of behaviorial variance between
    the filesystems where those ioctls originated (ext2 and XFS,
    respectively) and everybody else.

    - Standardize parameter checking for the SETFLAGS and FSSETXATTR
    ioctls (which were the file attribute setters for ext4 and xfs and
    have now been hoisted to the vfs)

    - Only allow the DAX flag to be set on files and directories"

    * tag 'vfs-fix-ioctl-checking-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
    vfs: only allow FSSETXATTR to set DAX flag on files and dirs
    vfs: teach vfs_ioc_fssetxattr_check to check extent size hints
    vfs: teach vfs_ioc_fssetxattr_check to check project id info
    vfs: create a generic checking function for FS_IOC_FSSETXATTR
    vfs: create a generic checking and prep function for FS_IOC_SETFLAGS

    Linus Torvalds
     

11 Jul, 2019

1 commit

  • Pull copy_file_range updates from Darrick Wong:
    "This fixes numerous parameter checking problems and inconsistent
    behaviors in the new(ish) copy_file_range system call.

    Now the system call will actually check its range parameters
    correctly; refuse to copy into files for which the caller does not
    have sufficient privileges; update mtime and strip setuid like file
    writes are supposed to do; and allows copying up to the EOF of the
    source file instead of failing the call like we used to.

    Summary:

    - Create a generic copy_file_range handler and make individual
    filesystems responsible for calling it (i.e. no more assuming that
    do_splice_direct will work or is appropriate)

    - Refactor copy_file_range and remap_range parameter checking where
    they are the same

    - Install missing copy_file_range parameter checking(!)

    - Remove suid/sgid and update mtime like any other file write

    - Change the behavior so that a copy range crossing the source file's
    eof will result in a short copy to the source file's eof instead of
    EINVAL

    - Permit filesystems to decide if they want to handle
    cross-superblock copy_file_range in their local handlers"

    * tag 'copy-file-range-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
    fuse: copy_file_range needs to strip setuid bits and update timestamps
    vfs: allow copy_file_range to copy across devices
    xfs: use file_modified() helper
    vfs: introduce file_modified() helper
    vfs: add missing checks to copy_file_range
    vfs: remove redundant checks from generic_remap_checks()
    vfs: introduce generic_file_rw_checks()
    vfs: no fallback for ->copy_file_range
    vfs: introduce generic_copy_file_range()

    Linus Torvalds
     

01 Jul, 2019

5 commits


10 Jun, 2019

1 commit

  • The combination of file_remove_privs() and file_update_mtime() is
    quite common in filesystem ->write_iter() methods.

    Modelled after the helper file_accessed(), introduce file_modified()
    and use it from generic_remap_file_range_prep().

    Note that the order of calling file_remove_privs() before
    file_update_mtime() in the helper was matched to the more common order by
    filesystems and not the current order in generic_remap_file_range_prep().

    Signed-off-by: Amir Goldstein
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Amir Goldstein
     

01 Jun, 2019

1 commit

  • Since a28334862993 ("page cache: Finish XArray conversion"), on most
    major Linux distributions, the page cache doesn't correctly transition
    when the hot data set is changing, and leaves the new pages thrashing
    indefinitely instead of kicking out the cold ones.

    On a freshly booted, freshly ssh'd into virtual machine with 1G RAM
    running stock Arch Linux:

    [root@ham ~]# ./reclaimtest.sh
    + dd of=workingset-a bs=1M count=0 seek=600
    + cat workingset-a
    + cat workingset-a
    + cat workingset-a
    + cat workingset-a
    + cat workingset-a
    + cat workingset-a
    + cat workingset-a
    + cat workingset-a
    + ./mincore workingset-a
    153600/153600 workingset-a
    + dd of=workingset-b bs=1M count=0 seek=600
    + cat workingset-b
    + cat workingset-b
    + cat workingset-b
    + cat workingset-b
    + ./mincore workingset-a workingset-b
    104029/153600 workingset-a
    120086/153600 workingset-b
    + cat workingset-b
    + cat workingset-b
    + cat workingset-b
    + cat workingset-b
    + ./mincore workingset-a workingset-b
    104029/153600 workingset-a
    120268/153600 workingset-b

    workingset-b is a 600M file on a 1G host that is otherwise entirely
    idle. No matter how often it's being accessed, it won't get cached.

    While investigating, I noticed that the non-resident information gets
    aggressively reclaimed - /proc/vmstat::workingset_nodereclaim. This is
    a problem because a workingset transition like this relies on the
    non-resident information tracked in the page cache tree of evicted
    file ranges: when the cache faults are refaults of recently evicted
    cache, we challenge the existing active set, and that allows a new
    workingset to establish itself.

    Tracing the shrinker that maintains this memory revealed that all page
    cache tree nodes were allocated to the root cgroup. This is a problem,
    because 1) the shrinker sizes the amount of non-resident information
    it keeps to the size of the cgroup's other memory and 2) on most major
    Linux distributions, only kernel threads live in the root cgroup and
    everything else gets put into services or session groups:

    [root@ham ~]# cat /proc/self/cgroup
    0::/user.slice/user-0.slice/session-c1.scope

    As a result, we basically maintain no non-resident information for the
    workloads running on the system, thus breaking the caching algorithm.

    Looking through the code, I found the culprit in the above-mentioned
    patch: when switching from the radix tree to xarray, it dropped the
    __GFP_ACCOUNT flag from the tree node allocations - the flag that
    makes sure the allocated memory gets charged to and tracked by the
    cgroup of the calling process - in this case, the one doing the fault.

    To fix this, allow xarray users to specify per-tree flag that makes
    xarray allocate nodes using __GFP_ACCOUNT. Then restore the page cache
    tree annotation to request such cgroup tracking for the cache nodes.

    With this patch applied, the page cache correctly converges on new
    workingsets again after just a few iterations:

    [root@ham ~]# ./reclaimtest.sh
    + dd of=workingset-a bs=1M count=0 seek=600
    + cat workingset-a
    + cat workingset-a
    + cat workingset-a
    + cat workingset-a
    + cat workingset-a
    + cat workingset-a
    + cat workingset-a
    + cat workingset-a
    + ./mincore workingset-a
    153600/153600 workingset-a
    + dd of=workingset-b bs=1M count=0 seek=600
    + cat workingset-b
    + ./mincore workingset-a workingset-b
    124607/153600 workingset-a
    87876/153600 workingset-b
    + cat workingset-b
    + ./mincore workingset-a workingset-b
    81313/153600 workingset-a
    133321/153600 workingset-b
    + cat workingset-b
    + ./mincore workingset-a workingset-b
    63036/153600 workingset-a
    153600/153600 workingset-b

    Cc: stable@vger.kernel.org # 4.20+
    Signed-off-by: Johannes Weiner
    Reviewed-by: Shakeel Butt
    Signed-off-by: Matthew Wilcox (Oracle)

    Johannes Weiner
     

21 May, 2019

1 commit

  • Add SPDX license identifiers to all files which:

    - Have no license information of any form

    - Have EXPORT_.*_SYMBOL_GPL inside which was used in the
    initial scan/conversion to ignore the file

    These files fall under the project license, GPL v2 only. The resulting SPDX
    license identifier is:

    GPL-2.0-only

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

08 May, 2019

2 commits

  • Pull misc vfs updates from Al Viro:
    "Assorted stuff, with no common topic whatsoever..."

    * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    libfs: document simple_get_link()
    Documentation/filesystems/Locking: fix ->get_link() prototype
    Documentation/filesystems/vfs.txt: document how ->i_link works
    Documentation/filesystems/vfs.txt: remove bogus "Last updated" date
    fs: use timespec64 in relatime_need_update
    fs/block_dev.c: remove unused include

    Linus Torvalds
     
  • Pull vfs inode freeing updates from Al Viro:
    "Introduction of separate method for RCU-delayed part of
    ->destroy_inode() (if any).

    Pretty much as posted, except that destroy_inode() stashes
    ->free_inode into the victim (anon-unioned with ->i_fops) before
    scheduling i_callback() and the last two patches (sockfs conversion
    and folding struct socket_wq into struct socket) are excluded - that
    pair should go through netdev once davem reopens his tree"

    * 'work.icache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (58 commits)
    orangefs: make use of ->free_inode()
    shmem: make use of ->free_inode()
    hugetlb: make use of ->free_inode()
    overlayfs: make use of ->free_inode()
    jfs: switch to ->free_inode()
    fuse: switch to ->free_inode()
    ext4: make use of ->free_inode()
    ecryptfs: make use of ->free_inode()
    ceph: use ->free_inode()
    btrfs: use ->free_inode()
    afs: switch to use of ->free_inode()
    dax: make use of ->free_inode()
    ntfs: switch to ->free_inode()
    securityfs: switch to ->free_inode()
    apparmor: switch to ->free_inode()
    rpcpipe: switch to ->free_inode()
    bpf: switch to ->free_inode()
    mqueue: switch to ->free_inode()
    ufs: switch to ->free_inode()
    coda: switch to ->free_inode()
    ...

    Linus Torvalds
     

02 May, 2019

1 commit

  • A lot of ->destroy_inode() instances end with call_rcu() of a callback
    that does RCU-delayed part of freeing. Introduce a new method for
    doing just that, with saner signature.

    Rules:
    ->destroy_inode ->free_inode
    f g immediate call of f(),
    RCU-delayed call of g()
    f NULL immediate call of f(),
    no RCU-delayed calls
    NULL g RCU-delayed call of g()
    NULL NULL RCU-delayed default freeing

    IOW, NULL ->free_inode gives the same behaviour as now.

    Note that NULL, NULL is equivalent to NULL, free_inode_nonrcu; we could
    mandate the latter form, but that would have very little benefit beyond
    making rules a bit more symmetric. It would break backwards compatibility,
    require extra boilerplate and expected semantics for (NULL, NULL) pair
    would have no use whatsoever...

    Signed-off-by: Al Viro

    Al Viro
     

29 Apr, 2019

1 commit

  • file_remove_privs() might be called for non-regular files, e.g.
    blkdev inode. There is no reason to do its job on things
    like blkdev inodes, pipes, or cdevs. Hence, abort if
    file does not refer to a regular inode.

    AV: more to the point, for devices there might be any number of
    inodes refering to given device. Which one to strip the permissions
    from, even if that made any sense in the first place? All of them
    will be observed with contents modified, after all.

    Found by LockDoc (Alexander Lochmann, Horst Schirmeier and Olaf
    Spinczyk)

    Reviewed-by: Jan Kara
    Signed-off-by: Alexander Lochmann
    Signed-off-by: Horst Schirmeier
    Signed-off-by: Al Viro

    Alexander Lochmann
     

26 Apr, 2019

1 commit

  • For some reason, the conversion of the VFS code away from 'struct timespec'
    left one function behind that still uses it, for absolutely no reason.

    Using timespec64 will make the atime update logic work correctly past
    y2038.

    Signed-off-by: Arnd Bergmann
    Signed-off-by: Al Viro

    Arnd Bergmann
     

06 Mar, 2019

1 commit

  • It seems that commits 5f16f3225b0624 and 00a1a053ebe5, both with same
    commitlog ("ext4: atomically set inode->i_flags in ext4_set_inode_flags()")
    introduced the set_mask_bits API, but somehow missed not using it in ext4
    in the end.

    Also, set_mask_bits() is used in fs quite a bit and we can possibly come
    up with a generic llsc based implementation (w/o the cmpxchg loop)

    Link: http://lkml.kernel.org/r/1548275584-18096-3-git-send-email-vgupta@synopsys.com
    Signed-off-by: Vineet Gupta
    Reviewed-by: Anthony Yznaga
    Cc: Alexander Viro
    Cc: Theodore Ts'o
    Cc: Peter Zijlstra (Intel)
    Cc: Chris Wilson
    Cc: Ingo Molnar
    Cc: Jani Nikula
    Cc: Miklos Szeredi
    Cc: Oleg Nesterov
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vineet Gupta
     

13 Feb, 2019

1 commit

  • This reverts commit a76cf1a474d7d ("mm: don't reclaim inodes with many
    attached pages").

    This change causes serious changes to page cache and inode cache
    behaviour and balance, resulting in major performance regressions when
    combining worklaods such as large file copies and kernel compiles.

    https://bugzilla.kernel.org/show_bug.cgi?id=202441

    This change is a hack to work around the problems introduced by changing
    how agressive shrinkers are on small caches in commit 172b06c32b94 ("mm:
    slowly shrink slabs with a relatively small number of objects"). It
    creates more problems than it solves, wasn't adequately reviewed or
    tested, so it needs to be reverted.

    Link: http://lkml.kernel.org/r/20190130041707.27750-2-david@fromorbit.com
    Fixes: a76cf1a474d7d ("mm: don't reclaim inodes with many attached pages")
    Signed-off-by: Dave Chinner
    Cc: Wolfgang Walter
    Cc: Roman Gushchin
    Cc: Spock
    Cc: Rik van Riel
    Cc: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Chinner
     

29 Dec, 2018

1 commit

  • Pull y2038 updates from Arnd Bergmann:
    "More syscalls and cleanups

    This concludes the main part of the system call rework for 64-bit
    time_t, which has spread over most of year 2018, the last six system
    calls being

    - ppoll
    - pselect6
    - io_pgetevents
    - recvmmsg
    - futex
    - rt_sigtimedwait

    As before, nothing changes for 64-bit architectures, while 32-bit
    architectures gain another entry point that differs only in the layout
    of the timespec structure. Hopefully in the next release we can wire
    up all 22 of those system calls on all 32-bit architectures, which
    gives us a baseline version for glibc to start using them.

    This does not include the clock_adjtime, getrusage/waitid, and
    getitimer/setitimer system calls. I still plan to have new versions of
    those as well, but they are not required for correct operation of the
    C library since they can be emulated using the old 32-bit time_t based
    system calls.

    Aside from the system calls, there are also a few cleanups here,
    removing old kernel internal interfaces that have become unused after
    all references got removed. The arch/sh cleanups are part of this,
    there were posted several times over the past year without a reaction
    from the maintainers, while the corresponding changes made it into all
    other architectures"

    * tag 'y2038-for-4.21' of ssh://gitolite.kernel.org:/pub/scm/linux/kernel/git/arnd/playground:
    timekeeping: remove obsolete time accessors
    vfs: replace current_kernel_time64 with ktime equivalent
    timekeeping: remove timespec_add/timespec_del
    timekeeping: remove unused {read,update}_persistent_clock
    sh: remove board_time_init() callback
    sh: remove unused rtc_sh_get/set_time infrastructure
    sh: sh03: rtc: push down rtc class ops into driver
    sh: dreamcast: rtc: push down rtc class ops into driver
    y2038: signal: Add compat_sys_rt_sigtimedwait_time64
    y2038: signal: Add sys_rt_sigtimedwait_time32
    y2038: socket: Add compat_sys_recvmmsg_time64
    y2038: futex: Add support for __kernel_timespec
    y2038: futex: Move compat implementation into futex.c
    io_pgetevents: use __kernel_timespec
    pselect6: use __kernel_timespec
    ppoll: use __kernel_timespec
    signal: Add restore_user_sigmask()
    signal: Add set_user_sigmask()

    Linus Torvalds
     

18 Dec, 2018

1 commit

  • current_time is the last remaining caller of current_kernel_time64(),
    which is a wrapper around ktime_get_coarse_real_ts64(). This calls the
    latter directly for consistency with the rest of the kernel that is moving
    to the ktime_get_ family of time accessors, as now documented in
    Documentation/core-api/timekeeping.rst.

    An open questions is whether we may want to actually call the more
    accurate ktime_get_real_ts64() for file systems that save high-resolution
    timestamps in their on-disk format. This would add a small overhead to
    each update of the inode stamps but lead to inode timestamps to actually
    have a usable resolution better than one jiffy (1 to 10 milliseconds
    normally). Experiments on a variety of hardware platforms show a typical
    time of around 100 CPU cycles to read the cycle counter and calculate the
    accurate time from that. On old platforms without a cycle counter, this
    can be signiciantly higher, up to several microseconds to access a
    hardware clock, but those have become very rare by now.

    I traced the original addition of the current_kernel_time() call to set
    the nanosecond fields back to linux-2.5.48, where Andi Kleen added a patch
    with subject "nanosecond stat timefields". Andi explains that the
    motivation was to introduce as little overhead as possible back then. At
    this time, reading the clock hardware was also more expensive when most
    architectures did not have a cycle counter.

    One side effect of having more accurate inode timestamp would be having to
    write out the inode every time that mtime/ctime/atime get touched on most
    systems, whereas many file systems today only write it when the timestamps
    have changed, i.e. at most once per jiffy unless something else changes
    as well. That change would certainly be noticed in some workloads, which
    is enough reason to not do it without a good reason, regardless of the
    cost of reading the time.

    One thing we could still consider however would be to round the timestamps
    from current_time() to multiples of NSEC_PER_JIFFY, e.g. full
    milliseconds rather than having six or seven meaningless but confusing
    digits at the end of the timestamp.

    Link: http://lkml.kernel.org/r/20180726130820.4174359-1-arnd@arndb.de
    Signed-off-by: Arnd Bergmann

    Arnd Bergmann
     

19 Nov, 2018

1 commit

  • Spock reported that commit 172b06c32b94 ("mm: slowly shrink slabs with a
    relatively small number of objects") leads to a regression on his setup:
    periodically the majority of the pagecache is evicted without an obvious
    reason, while before the change the amount of free memory was balancing
    around the watermark.

    The reason behind is that the mentioned above change created some
    minimal background pressure on the inode cache. The problem is that if
    an inode is considered to be reclaimed, all belonging pagecache page are
    stripped, no matter how many of them are there. So, if a huge
    multi-gigabyte file is cached in the memory, and the goal is to reclaim
    only few slab objects (unused inodes), we still can eventually evict all
    gigabytes of the pagecache at once.

    The workload described by Spock has few large non-mapped files in the
    pagecache, so it's especially noticeable.

    To solve the problem let's postpone the reclaim of inodes, which have
    more than 1 attached page. Let's wait until the pagecache pages will be
    evicted naturally by scanning the corresponding LRU lists, and only then
    reclaim the inode structure.

    Link: http://lkml.kernel.org/r/20181023164302.20436-1-guro@fb.com
    Signed-off-by: Roman Gushchin
    Reported-by: Spock
    Tested-by: Spock
    Reviewed-by: Andrew Morton
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: Randy Dunlap
    Cc: [4.19.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     

31 Oct, 2018

1 commit

  • Move remaining definitions and declarations from include/linux/bootmem.h
    into include/linux/memblock.h and remove the redundant header.

    The includes were replaced with the semantic patch below and then
    semi-automated removal of duplicated '#include

    @@
    @@
    - #include
    + #include

    [sfr@canb.auug.org.au: dma-direct: fix up for the removal of linux/bootmem.h]
    Link: http://lkml.kernel.org/r/20181002185342.133d1680@canb.auug.org.au
    [sfr@canb.auug.org.au: powerpc: fix up for removal of linux/bootmem.h]
    Link: http://lkml.kernel.org/r/20181005161406.73ef8727@canb.auug.org.au
    [sfr@canb.auug.org.au: x86/kaslr, ACPI/NUMA: fix for linux/bootmem.h removal]
    Link: http://lkml.kernel.org/r/20181008190341.5e396491@canb.auug.org.au
    Link: http://lkml.kernel.org/r/1536927045-23536-30-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Signed-off-by: Stephen Rothwell
    Acked-by: Michal Hocko
    Cc: Catalin Marinas
    Cc: Chris Zankel
    Cc: "David S. Miller"
    Cc: Geert Uytterhoeven
    Cc: Greentime Hu
    Cc: Greg Kroah-Hartman
    Cc: Guan Xuetao
    Cc: Ingo Molnar
    Cc: "James E.J. Bottomley"
    Cc: Jonas Bonn
    Cc: Jonathan Corbet
    Cc: Ley Foon Tan
    Cc: Mark Salter
    Cc: Martin Schwidefsky
    Cc: Matt Turner
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Palmer Dabbelt
    Cc: Paul Burton
    Cc: Richard Kuo
    Cc: Richard Weinberger
    Cc: Rich Felker
    Cc: Russell King
    Cc: Serge Semin
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vineet Gupta
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

21 Oct, 2018

1 commit


22 Aug, 2018

1 commit

  • Pull overlayfs updates from Miklos Szeredi:
    "This contains two new features:

    - Stack file operations: this allows removal of several hacks from
    the VFS, proper interaction of read-only open files with copy-up,
    possibility to implement fs modifying ioctls properly, and others.

    - Metadata only copy-up: when file is on lower layer and only
    metadata is modified (except size) then only copy up the metadata
    and continue to use the data from the lower file"

    * tag 'ovl-update-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: (66 commits)
    ovl: Enable metadata only feature
    ovl: Do not do metacopy only for ioctl modifying file attr
    ovl: Do not do metadata only copy-up for truncate operation
    ovl: add helper to force data copy-up
    ovl: Check redirect on index as well
    ovl: Set redirect on upper inode when it is linked
    ovl: Set redirect on metacopy files upon rename
    ovl: Do not set dentry type ORIGIN for broken hardlinks
    ovl: Add an inode flag OVL_CONST_INO
    ovl: Treat metacopy dentries as type OVL_PATH_MERGE
    ovl: Check redirects for metacopy files
    ovl: Move some dir related ovl_lookup_single() code in else block
    ovl: Do not expose metacopy only dentry from d_real()
    ovl: Open file with data except for the case of fsync
    ovl: Add helper ovl_inode_realdata()
    ovl: Store lower data inode in ovl_inode
    ovl: Fix ovl_getattr() to get number of blocks from lower
    ovl: Add helper ovl_dentry_lowerdata() to get lower data dentry
    ovl: Copy up meta inode data from lowest data inode
    ovl: Modify ovl_lookup() and friends to lookup metacopy dentry
    ...

    Linus Torvalds