12 Jan, 2020

1 commit

  • [ Upstream commit 04646aebd30b99f2cfa0182435a2ec252fcb16d0 ]

    Anything that walks all inodes on sb->s_inodes list without rescheduling
    risks softlockups.

    Previous efforts were made in 2 functions, see:

    c27d82f fs/drop_caches.c: avoid softlockups in drop_pagecache_sb()
    ac05fbb inode: don't softlockup when evicting inodes

    but there hasn't been an audit of all walkers, so do that now. This
    also consistently moves the cond_resched() calls to the bottom of each
    loop in cases where it already exists.

    One loop remains: remove_dquot_ref(), because I'm not quite sure how
    to deal with that one w/o taking the i_lock.

    Signed-off-by: Eric Sandeen
    Reviewed-by: Jan Kara
    Signed-off-by: Al Viro
    Signed-off-by: Sasha Levin

    Eric Sandeen
     

25 Sep, 2019

1 commit

  • In previous patch, an application could put part of its text section in
    THP via madvise(). These THPs will be protected from writes when the
    application is still running (TXTBSY). However, after the application
    exits, the file is available for writes.

    This patch avoids writes to file THP by dropping page cache for the file
    when the file is open for write. A new counter nr_thps is added to struct
    address_space. In do_dentry_open(), if the file is open for write and
    nr_thps is non-zero, we drop page cache for the whole file.

    Link: http://lkml.kernel.org/r/20190801184244.3169074-8-songliubraving@fb.com
    Signed-off-by: Song Liu
    Reported-by: kbuild test robot
    Acked-by: Rik van Riel
    Acked-by: Kirill A. Shutemov
    Acked-by: Johannes Weiner
    Cc: Hillf Danton
    Cc: Hugh Dickins
    Cc: William Kucharski
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Song Liu
     

30 Aug, 2019

1 commit

  • timespec_trunc() function is used to truncate a
    filesystem timestamp to the right granularity.
    But, the function does not clamp tv_sec part of the
    timestamps according to the filesystem timestamp limits.

    The replacement api: timestamp_truncate() also alters the
    signature of the function to accommodate filesystem
    timestamp clamping according to flesystem limits.

    Note that the tv_nsec part is set to 0 if tv_sec is not within
    the range supported for the filesystem.

    Signed-off-by: Deepa Dinamani
    Acked-by: Jeff Layton

    Deepa Dinamani
     

13 Jul, 2019

1 commit

  • Pull common SETFLAGS/FSSETXATTR parameter checking from Darrick Wong:
    "Here's a patch series that sets up common parameter checking functions
    for the FS_IOC_SETFLAGS and FS_IOC_FSSETXATTR ioctl implementations.

    The goal here is to reduce the amount of behaviorial variance between
    the filesystems where those ioctls originated (ext2 and XFS,
    respectively) and everybody else.

    - Standardize parameter checking for the SETFLAGS and FSSETXATTR
    ioctls (which were the file attribute setters for ext4 and xfs and
    have now been hoisted to the vfs)

    - Only allow the DAX flag to be set on files and directories"

    * tag 'vfs-fix-ioctl-checking-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
    vfs: only allow FSSETXATTR to set DAX flag on files and dirs
    vfs: teach vfs_ioc_fssetxattr_check to check extent size hints
    vfs: teach vfs_ioc_fssetxattr_check to check project id info
    vfs: create a generic checking function for FS_IOC_FSSETXATTR
    vfs: create a generic checking and prep function for FS_IOC_SETFLAGS

    Linus Torvalds
     

11 Jul, 2019

1 commit

  • Pull copy_file_range updates from Darrick Wong:
    "This fixes numerous parameter checking problems and inconsistent
    behaviors in the new(ish) copy_file_range system call.

    Now the system call will actually check its range parameters
    correctly; refuse to copy into files for which the caller does not
    have sufficient privileges; update mtime and strip setuid like file
    writes are supposed to do; and allows copying up to the EOF of the
    source file instead of failing the call like we used to.

    Summary:

    - Create a generic copy_file_range handler and make individual
    filesystems responsible for calling it (i.e. no more assuming that
    do_splice_direct will work or is appropriate)

    - Refactor copy_file_range and remap_range parameter checking where
    they are the same

    - Install missing copy_file_range parameter checking(!)

    - Remove suid/sgid and update mtime like any other file write

    - Change the behavior so that a copy range crossing the source file's
    eof will result in a short copy to the source file's eof instead of
    EINVAL

    - Permit filesystems to decide if they want to handle
    cross-superblock copy_file_range in their local handlers"

    * tag 'copy-file-range-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
    fuse: copy_file_range needs to strip setuid bits and update timestamps
    vfs: allow copy_file_range to copy across devices
    xfs: use file_modified() helper
    vfs: introduce file_modified() helper
    vfs: add missing checks to copy_file_range
    vfs: remove redundant checks from generic_remap_checks()
    vfs: introduce generic_file_rw_checks()
    vfs: no fallback for ->copy_file_range
    vfs: introduce generic_copy_file_range()

    Linus Torvalds
     

01 Jul, 2019

5 commits


10 Jun, 2019

1 commit

  • The combination of file_remove_privs() and file_update_mtime() is
    quite common in filesystem ->write_iter() methods.

    Modelled after the helper file_accessed(), introduce file_modified()
    and use it from generic_remap_file_range_prep().

    Note that the order of calling file_remove_privs() before
    file_update_mtime() in the helper was matched to the more common order by
    filesystems and not the current order in generic_remap_file_range_prep().

    Signed-off-by: Amir Goldstein
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Amir Goldstein
     

01 Jun, 2019

1 commit

  • Since a28334862993 ("page cache: Finish XArray conversion"), on most
    major Linux distributions, the page cache doesn't correctly transition
    when the hot data set is changing, and leaves the new pages thrashing
    indefinitely instead of kicking out the cold ones.

    On a freshly booted, freshly ssh'd into virtual machine with 1G RAM
    running stock Arch Linux:

    [root@ham ~]# ./reclaimtest.sh
    + dd of=workingset-a bs=1M count=0 seek=600
    + cat workingset-a
    + cat workingset-a
    + cat workingset-a
    + cat workingset-a
    + cat workingset-a
    + cat workingset-a
    + cat workingset-a
    + cat workingset-a
    + ./mincore workingset-a
    153600/153600 workingset-a
    + dd of=workingset-b bs=1M count=0 seek=600
    + cat workingset-b
    + cat workingset-b
    + cat workingset-b
    + cat workingset-b
    + ./mincore workingset-a workingset-b
    104029/153600 workingset-a
    120086/153600 workingset-b
    + cat workingset-b
    + cat workingset-b
    + cat workingset-b
    + cat workingset-b
    + ./mincore workingset-a workingset-b
    104029/153600 workingset-a
    120268/153600 workingset-b

    workingset-b is a 600M file on a 1G host that is otherwise entirely
    idle. No matter how often it's being accessed, it won't get cached.

    While investigating, I noticed that the non-resident information gets
    aggressively reclaimed - /proc/vmstat::workingset_nodereclaim. This is
    a problem because a workingset transition like this relies on the
    non-resident information tracked in the page cache tree of evicted
    file ranges: when the cache faults are refaults of recently evicted
    cache, we challenge the existing active set, and that allows a new
    workingset to establish itself.

    Tracing the shrinker that maintains this memory revealed that all page
    cache tree nodes were allocated to the root cgroup. This is a problem,
    because 1) the shrinker sizes the amount of non-resident information
    it keeps to the size of the cgroup's other memory and 2) on most major
    Linux distributions, only kernel threads live in the root cgroup and
    everything else gets put into services or session groups:

    [root@ham ~]# cat /proc/self/cgroup
    0::/user.slice/user-0.slice/session-c1.scope

    As a result, we basically maintain no non-resident information for the
    workloads running on the system, thus breaking the caching algorithm.

    Looking through the code, I found the culprit in the above-mentioned
    patch: when switching from the radix tree to xarray, it dropped the
    __GFP_ACCOUNT flag from the tree node allocations - the flag that
    makes sure the allocated memory gets charged to and tracked by the
    cgroup of the calling process - in this case, the one doing the fault.

    To fix this, allow xarray users to specify per-tree flag that makes
    xarray allocate nodes using __GFP_ACCOUNT. Then restore the page cache
    tree annotation to request such cgroup tracking for the cache nodes.

    With this patch applied, the page cache correctly converges on new
    workingsets again after just a few iterations:

    [root@ham ~]# ./reclaimtest.sh
    + dd of=workingset-a bs=1M count=0 seek=600
    + cat workingset-a
    + cat workingset-a
    + cat workingset-a
    + cat workingset-a
    + cat workingset-a
    + cat workingset-a
    + cat workingset-a
    + cat workingset-a
    + ./mincore workingset-a
    153600/153600 workingset-a
    + dd of=workingset-b bs=1M count=0 seek=600
    + cat workingset-b
    + ./mincore workingset-a workingset-b
    124607/153600 workingset-a
    87876/153600 workingset-b
    + cat workingset-b
    + ./mincore workingset-a workingset-b
    81313/153600 workingset-a
    133321/153600 workingset-b
    + cat workingset-b
    + ./mincore workingset-a workingset-b
    63036/153600 workingset-a
    153600/153600 workingset-b

    Cc: stable@vger.kernel.org # 4.20+
    Signed-off-by: Johannes Weiner
    Reviewed-by: Shakeel Butt
    Signed-off-by: Matthew Wilcox (Oracle)

    Johannes Weiner
     

21 May, 2019

1 commit

  • Add SPDX license identifiers to all files which:

    - Have no license information of any form

    - Have EXPORT_.*_SYMBOL_GPL inside which was used in the
    initial scan/conversion to ignore the file

    These files fall under the project license, GPL v2 only. The resulting SPDX
    license identifier is:

    GPL-2.0-only

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

08 May, 2019

2 commits

  • Pull misc vfs updates from Al Viro:
    "Assorted stuff, with no common topic whatsoever..."

    * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    libfs: document simple_get_link()
    Documentation/filesystems/Locking: fix ->get_link() prototype
    Documentation/filesystems/vfs.txt: document how ->i_link works
    Documentation/filesystems/vfs.txt: remove bogus "Last updated" date
    fs: use timespec64 in relatime_need_update
    fs/block_dev.c: remove unused include

    Linus Torvalds
     
  • Pull vfs inode freeing updates from Al Viro:
    "Introduction of separate method for RCU-delayed part of
    ->destroy_inode() (if any).

    Pretty much as posted, except that destroy_inode() stashes
    ->free_inode into the victim (anon-unioned with ->i_fops) before
    scheduling i_callback() and the last two patches (sockfs conversion
    and folding struct socket_wq into struct socket) are excluded - that
    pair should go through netdev once davem reopens his tree"

    * 'work.icache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (58 commits)
    orangefs: make use of ->free_inode()
    shmem: make use of ->free_inode()
    hugetlb: make use of ->free_inode()
    overlayfs: make use of ->free_inode()
    jfs: switch to ->free_inode()
    fuse: switch to ->free_inode()
    ext4: make use of ->free_inode()
    ecryptfs: make use of ->free_inode()
    ceph: use ->free_inode()
    btrfs: use ->free_inode()
    afs: switch to use of ->free_inode()
    dax: make use of ->free_inode()
    ntfs: switch to ->free_inode()
    securityfs: switch to ->free_inode()
    apparmor: switch to ->free_inode()
    rpcpipe: switch to ->free_inode()
    bpf: switch to ->free_inode()
    mqueue: switch to ->free_inode()
    ufs: switch to ->free_inode()
    coda: switch to ->free_inode()
    ...

    Linus Torvalds
     

02 May, 2019

1 commit

  • A lot of ->destroy_inode() instances end with call_rcu() of a callback
    that does RCU-delayed part of freeing. Introduce a new method for
    doing just that, with saner signature.

    Rules:
    ->destroy_inode ->free_inode
    f g immediate call of f(),
    RCU-delayed call of g()
    f NULL immediate call of f(),
    no RCU-delayed calls
    NULL g RCU-delayed call of g()
    NULL NULL RCU-delayed default freeing

    IOW, NULL ->free_inode gives the same behaviour as now.

    Note that NULL, NULL is equivalent to NULL, free_inode_nonrcu; we could
    mandate the latter form, but that would have very little benefit beyond
    making rules a bit more symmetric. It would break backwards compatibility,
    require extra boilerplate and expected semantics for (NULL, NULL) pair
    would have no use whatsoever...

    Signed-off-by: Al Viro

    Al Viro
     

29 Apr, 2019

1 commit

  • file_remove_privs() might be called for non-regular files, e.g.
    blkdev inode. There is no reason to do its job on things
    like blkdev inodes, pipes, or cdevs. Hence, abort if
    file does not refer to a regular inode.

    AV: more to the point, for devices there might be any number of
    inodes refering to given device. Which one to strip the permissions
    from, even if that made any sense in the first place? All of them
    will be observed with contents modified, after all.

    Found by LockDoc (Alexander Lochmann, Horst Schirmeier and Olaf
    Spinczyk)

    Reviewed-by: Jan Kara
    Signed-off-by: Alexander Lochmann
    Signed-off-by: Horst Schirmeier
    Signed-off-by: Al Viro

    Alexander Lochmann
     

26 Apr, 2019

1 commit

  • For some reason, the conversion of the VFS code away from 'struct timespec'
    left one function behind that still uses it, for absolutely no reason.

    Using timespec64 will make the atime update logic work correctly past
    y2038.

    Signed-off-by: Arnd Bergmann
    Signed-off-by: Al Viro

    Arnd Bergmann
     

06 Mar, 2019

1 commit

  • It seems that commits 5f16f3225b0624 and 00a1a053ebe5, both with same
    commitlog ("ext4: atomically set inode->i_flags in ext4_set_inode_flags()")
    introduced the set_mask_bits API, but somehow missed not using it in ext4
    in the end.

    Also, set_mask_bits() is used in fs quite a bit and we can possibly come
    up with a generic llsc based implementation (w/o the cmpxchg loop)

    Link: http://lkml.kernel.org/r/1548275584-18096-3-git-send-email-vgupta@synopsys.com
    Signed-off-by: Vineet Gupta
    Reviewed-by: Anthony Yznaga
    Cc: Alexander Viro
    Cc: Theodore Ts'o
    Cc: Peter Zijlstra (Intel)
    Cc: Chris Wilson
    Cc: Ingo Molnar
    Cc: Jani Nikula
    Cc: Miklos Szeredi
    Cc: Oleg Nesterov
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vineet Gupta
     

13 Feb, 2019

1 commit

  • This reverts commit a76cf1a474d7d ("mm: don't reclaim inodes with many
    attached pages").

    This change causes serious changes to page cache and inode cache
    behaviour and balance, resulting in major performance regressions when
    combining worklaods such as large file copies and kernel compiles.

    https://bugzilla.kernel.org/show_bug.cgi?id=202441

    This change is a hack to work around the problems introduced by changing
    how agressive shrinkers are on small caches in commit 172b06c32b94 ("mm:
    slowly shrink slabs with a relatively small number of objects"). It
    creates more problems than it solves, wasn't adequately reviewed or
    tested, so it needs to be reverted.

    Link: http://lkml.kernel.org/r/20190130041707.27750-2-david@fromorbit.com
    Fixes: a76cf1a474d7d ("mm: don't reclaim inodes with many attached pages")
    Signed-off-by: Dave Chinner
    Cc: Wolfgang Walter
    Cc: Roman Gushchin
    Cc: Spock
    Cc: Rik van Riel
    Cc: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Chinner
     

29 Dec, 2018

1 commit

  • Pull y2038 updates from Arnd Bergmann:
    "More syscalls and cleanups

    This concludes the main part of the system call rework for 64-bit
    time_t, which has spread over most of year 2018, the last six system
    calls being

    - ppoll
    - pselect6
    - io_pgetevents
    - recvmmsg
    - futex
    - rt_sigtimedwait

    As before, nothing changes for 64-bit architectures, while 32-bit
    architectures gain another entry point that differs only in the layout
    of the timespec structure. Hopefully in the next release we can wire
    up all 22 of those system calls on all 32-bit architectures, which
    gives us a baseline version for glibc to start using them.

    This does not include the clock_adjtime, getrusage/waitid, and
    getitimer/setitimer system calls. I still plan to have new versions of
    those as well, but they are not required for correct operation of the
    C library since they can be emulated using the old 32-bit time_t based
    system calls.

    Aside from the system calls, there are also a few cleanups here,
    removing old kernel internal interfaces that have become unused after
    all references got removed. The arch/sh cleanups are part of this,
    there were posted several times over the past year without a reaction
    from the maintainers, while the corresponding changes made it into all
    other architectures"

    * tag 'y2038-for-4.21' of ssh://gitolite.kernel.org:/pub/scm/linux/kernel/git/arnd/playground:
    timekeeping: remove obsolete time accessors
    vfs: replace current_kernel_time64 with ktime equivalent
    timekeeping: remove timespec_add/timespec_del
    timekeeping: remove unused {read,update}_persistent_clock
    sh: remove board_time_init() callback
    sh: remove unused rtc_sh_get/set_time infrastructure
    sh: sh03: rtc: push down rtc class ops into driver
    sh: dreamcast: rtc: push down rtc class ops into driver
    y2038: signal: Add compat_sys_rt_sigtimedwait_time64
    y2038: signal: Add sys_rt_sigtimedwait_time32
    y2038: socket: Add compat_sys_recvmmsg_time64
    y2038: futex: Add support for __kernel_timespec
    y2038: futex: Move compat implementation into futex.c
    io_pgetevents: use __kernel_timespec
    pselect6: use __kernel_timespec
    ppoll: use __kernel_timespec
    signal: Add restore_user_sigmask()
    signal: Add set_user_sigmask()

    Linus Torvalds
     

18 Dec, 2018

1 commit

  • current_time is the last remaining caller of current_kernel_time64(),
    which is a wrapper around ktime_get_coarse_real_ts64(). This calls the
    latter directly for consistency with the rest of the kernel that is moving
    to the ktime_get_ family of time accessors, as now documented in
    Documentation/core-api/timekeeping.rst.

    An open questions is whether we may want to actually call the more
    accurate ktime_get_real_ts64() for file systems that save high-resolution
    timestamps in their on-disk format. This would add a small overhead to
    each update of the inode stamps but lead to inode timestamps to actually
    have a usable resolution better than one jiffy (1 to 10 milliseconds
    normally). Experiments on a variety of hardware platforms show a typical
    time of around 100 CPU cycles to read the cycle counter and calculate the
    accurate time from that. On old platforms without a cycle counter, this
    can be signiciantly higher, up to several microseconds to access a
    hardware clock, but those have become very rare by now.

    I traced the original addition of the current_kernel_time() call to set
    the nanosecond fields back to linux-2.5.48, where Andi Kleen added a patch
    with subject "nanosecond stat timefields". Andi explains that the
    motivation was to introduce as little overhead as possible back then. At
    this time, reading the clock hardware was also more expensive when most
    architectures did not have a cycle counter.

    One side effect of having more accurate inode timestamp would be having to
    write out the inode every time that mtime/ctime/atime get touched on most
    systems, whereas many file systems today only write it when the timestamps
    have changed, i.e. at most once per jiffy unless something else changes
    as well. That change would certainly be noticed in some workloads, which
    is enough reason to not do it without a good reason, regardless of the
    cost of reading the time.

    One thing we could still consider however would be to round the timestamps
    from current_time() to multiples of NSEC_PER_JIFFY, e.g. full
    milliseconds rather than having six or seven meaningless but confusing
    digits at the end of the timestamp.

    Link: http://lkml.kernel.org/r/20180726130820.4174359-1-arnd@arndb.de
    Signed-off-by: Arnd Bergmann

    Arnd Bergmann
     

19 Nov, 2018

1 commit

  • Spock reported that commit 172b06c32b94 ("mm: slowly shrink slabs with a
    relatively small number of objects") leads to a regression on his setup:
    periodically the majority of the pagecache is evicted without an obvious
    reason, while before the change the amount of free memory was balancing
    around the watermark.

    The reason behind is that the mentioned above change created some
    minimal background pressure on the inode cache. The problem is that if
    an inode is considered to be reclaimed, all belonging pagecache page are
    stripped, no matter how many of them are there. So, if a huge
    multi-gigabyte file is cached in the memory, and the goal is to reclaim
    only few slab objects (unused inodes), we still can eventually evict all
    gigabytes of the pagecache at once.

    The workload described by Spock has few large non-mapped files in the
    pagecache, so it's especially noticeable.

    To solve the problem let's postpone the reclaim of inodes, which have
    more than 1 attached page. Let's wait until the pagecache pages will be
    evicted naturally by scanning the corresponding LRU lists, and only then
    reclaim the inode structure.

    Link: http://lkml.kernel.org/r/20181023164302.20436-1-guro@fb.com
    Signed-off-by: Roman Gushchin
    Reported-by: Spock
    Tested-by: Spock
    Reviewed-by: Andrew Morton
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: Randy Dunlap
    Cc: [4.19.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     

31 Oct, 2018

1 commit

  • Move remaining definitions and declarations from include/linux/bootmem.h
    into include/linux/memblock.h and remove the redundant header.

    The includes were replaced with the semantic patch below and then
    semi-automated removal of duplicated '#include

    @@
    @@
    - #include
    + #include

    [sfr@canb.auug.org.au: dma-direct: fix up for the removal of linux/bootmem.h]
    Link: http://lkml.kernel.org/r/20181002185342.133d1680@canb.auug.org.au
    [sfr@canb.auug.org.au: powerpc: fix up for removal of linux/bootmem.h]
    Link: http://lkml.kernel.org/r/20181005161406.73ef8727@canb.auug.org.au
    [sfr@canb.auug.org.au: x86/kaslr, ACPI/NUMA: fix for linux/bootmem.h removal]
    Link: http://lkml.kernel.org/r/20181008190341.5e396491@canb.auug.org.au
    Link: http://lkml.kernel.org/r/1536927045-23536-30-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Signed-off-by: Stephen Rothwell
    Acked-by: Michal Hocko
    Cc: Catalin Marinas
    Cc: Chris Zankel
    Cc: "David S. Miller"
    Cc: Geert Uytterhoeven
    Cc: Greentime Hu
    Cc: Greg Kroah-Hartman
    Cc: Guan Xuetao
    Cc: Ingo Molnar
    Cc: "James E.J. Bottomley"
    Cc: Jonas Bonn
    Cc: Jonathan Corbet
    Cc: Ley Foon Tan
    Cc: Mark Salter
    Cc: Martin Schwidefsky
    Cc: Matt Turner
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Palmer Dabbelt
    Cc: Paul Burton
    Cc: Richard Kuo
    Cc: Richard Weinberger
    Cc: Rich Felker
    Cc: Russell King
    Cc: Serge Semin
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vineet Gupta
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

21 Oct, 2018

1 commit


22 Aug, 2018

1 commit

  • Pull overlayfs updates from Miklos Szeredi:
    "This contains two new features:

    - Stack file operations: this allows removal of several hacks from
    the VFS, proper interaction of read-only open files with copy-up,
    possibility to implement fs modifying ioctls properly, and others.

    - Metadata only copy-up: when file is on lower layer and only
    metadata is modified (except size) then only copy up the metadata
    and continue to use the data from the lower file"

    * tag 'ovl-update-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: (66 commits)
    ovl: Enable metadata only feature
    ovl: Do not do metacopy only for ioctl modifying file attr
    ovl: Do not do metadata only copy-up for truncate operation
    ovl: add helper to force data copy-up
    ovl: Check redirect on index as well
    ovl: Set redirect on upper inode when it is linked
    ovl: Set redirect on metacopy files upon rename
    ovl: Do not set dentry type ORIGIN for broken hardlinks
    ovl: Add an inode flag OVL_CONST_INO
    ovl: Treat metacopy dentries as type OVL_PATH_MERGE
    ovl: Check redirects for metacopy files
    ovl: Move some dir related ovl_lookup_single() code in else block
    ovl: Do not expose metacopy only dentry from d_real()
    ovl: Open file with data except for the case of fsync
    ovl: Add helper ovl_inode_realdata()
    ovl: Store lower data inode in ovl_inode
    ovl: Fix ovl_getattr() to get number of blocks from lower
    ovl: Add helper ovl_dentry_lowerdata() to get lower data dentry
    ovl: Copy up meta inode data from lowest data inode
    ovl: Modify ovl_lookup() and friends to lookup metacopy dentry
    ...

    Linus Torvalds
     

14 Aug, 2018

1 commit

  • Pull vfs icache updates from Al Viro:

    - NFS mkdir/open_by_handle race fix

    - analogous solution for FUSE, replacing the one currently in mainline

    - new primitive to be used when discarding halfway set up inodes on
    failed object creation; gives sane warranties re icache lookups not
    returning such doomed by still not freed inodes. A bunch of
    filesystems switched to that animal.

    - Miklos' fix for last cycle regression in iget5_locked(); -stable will
    need a slightly different variant, unfortunately.

    - misc bits and pieces around things icache-related (in adfs and jfs).

    * 'work.mkdir' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    jfs: don't bother with make_bad_inode() in ialloc()
    adfs: don't put inodes into icache
    new helper: inode_fake_hash()
    vfs: don't evict uninitialized inode
    jfs: switch to discard_new_inode()
    ext2: make sure that partially set up inodes won't be returned by ext2_iget()
    udf: switch to discard_new_inode()
    ufs: switch to discard_new_inode()
    btrfs: switch to discard_new_inode()
    new primitive: discard_new_inode()
    kill d_instantiate_no_diralias()
    nfs_instantiate(): prevent multiple aliases for directory inode

    Linus Torvalds
     

04 Aug, 2018

2 commits

  • iput() ends up calling ->evict() on new inode, which is not yet initialized
    by owning fs. So use destroy_inode() instead.

    Add to sb->s_inodes list only if inode is not in I_CREATING state (meaning
    that it wasn't allocated with new_inode(), which already does the
    insertion).

    Reported-by: Al Viro
    Signed-off-by: Miklos Szeredi
    Fixes: 80ea09a002bf ("vfs: factor out inode_insert5()")
    Signed-off-by: Al Viro

    Miklos Szeredi
     
  • We don't want open-by-handle picking half-set-up in-core
    struct inode from e.g. mkdir() having failed halfway through.
    In other words, we don't want such inodes returned by iget_locked()
    on their way to extinction. However, we can't just have them
    unhashed - otherwise open-by-handle immediately *after* that would've
    ended up creating a new in-core inode over the on-disk one that
    is in process of being freed right under us.

    Solution: new flag (I_CREATING) set by insert_inode_locked() and
    removed by unlock_new_inode() and a new primitive (discard_new_inode())
    to be used by such halfway-through-setup failure exits instead of
    unlock_new_inode() / iput() combinations. That primitive unlocks new
    inode, but leaves I_CREATING in place.

    iget_locked() treats finding an I_CREATING inode as failure
    (-ESTALE, once we sort out the error propagation).
    insert_inode_locked() treats the same as instant -EBUSY.
    ilookup() treats those as icache miss.

    [Fix by Dan Carpenter folded in]

    Signed-off-by: Al Viro

    Al Viro
     

18 Jul, 2018

2 commits


06 Jul, 2018

1 commit

  • sgid directories have special semantics, making newly created files in
    the directory belong to the group of the directory, and newly created
    subdirectories will also become sgid. This is historically used for
    group-shared directories.

    But group directories writable by non-group members should not imply
    that such non-group members can magically join the group, so make sure
    to clear the sgid bit on non-directories for non-members (but remember
    that sgid without group execute means "mandatory locking", just to
    confuse things even more).

    Reported-by: Jann Horn
    Cc: Andy Lutomirski
    Cc: Al Viro
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

15 Jun, 2018

1 commit

  • Pull inode timestamps conversion to timespec64 from Arnd Bergmann:
    "This is a late set of changes from Deepa Dinamani doing an automated
    treewide conversion of the inode and iattr structures from 'timespec'
    to 'timespec64', to push the conversion from the VFS layer into the
    individual file systems.

    As Deepa writes:

    'The series aims to switch vfs timestamps to use struct timespec64.
    Currently vfs uses struct timespec, which is not y2038 safe.

    The series involves the following:
    1. Add vfs helper functions for supporting struct timepec64
    timestamps.
    2. Cast prints of vfs timestamps to avoid warnings after the switch.
    3. Simplify code using vfs timestamps so that the actual replacement
    becomes easy.
    4. Convert vfs timestamps to use struct timespec64 using a script.
    This is a flag day patch.

    Next steps:
    1. Convert APIs that can handle timespec64, instead of converting
    timestamps at the boundaries.
    2. Update internal data structures to avoid timestamp conversions'

    Thomas Gleixner adds:

    'I think there is no point to drag that out for the next merge
    window. The whole thing needs to be done in one go for the core
    changes which means that you're going to play that catchup game
    forever. Let's get over with it towards the end of the merge window'"

    * tag 'vfs-timespec64' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground:
    pstore: Remove bogus format string definition
    vfs: change inode times to use struct timespec64
    pstore: Convert internal records to timespec64
    udf: Simplify calls to udf_disk_stamp_to_time
    fs: nfs: get rid of memcpys for inode times
    ceph: make inode time prints to be long long
    lustre: Use long long type to print inode time
    fs: add timespec64_truncate()

    Linus Torvalds
     

07 Jun, 2018

1 commit

  • Pull overlayfs fixes from Miklos Szeredi:
    "This contains a fix for the vfs_mkdir() issue discovered by Al, as
    well as other fixes and cleanups"

    * tag 'ovl-fixes-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
    ovl: use inode_insert5() to hash a newly created inode
    ovl: Pass argument to ovl_get_inode() in a structure
    vfs: factor out inode_insert5()
    ovl: clean up copy-up error paths
    ovl: return EIO on internal error
    ovl: make ovl_create_real() cope with vfs_mkdir() safely
    ovl: create helper ovl_create_temp()
    ovl: return dentry from ovl_create_real()
    ovl: struct cattr cleanups
    ovl: strip debug argument from ovl_do_ helpers
    ovl: remove WARN_ON() real inode attributes mismatch
    ovl: Kconfig documentation fixes
    ovl: update documentation for unionmount-testsuite

    Linus Torvalds
     

06 Jun, 2018

1 commit

  • struct timespec is not y2038 safe. Transition vfs to use
    y2038 safe struct timespec64 instead.

    The change was made with the help of the following cocinelle
    script. This catches about 80% of the changes.
    All the header file and logic changes are included in the
    first 5 rules. The rest are trivial substitutions.
    I avoid changing any of the function signatures or any other
    filesystem specific data structures to keep the patch simple
    for review.

    The script can be a little shorter by combining different cases.
    But, this version was sufficient for my usecase.

    virtual patch

    @ depends on patch @
    identifier now;
    @@
    - struct timespec
    + struct timespec64
    current_time ( ... )
    {
    - struct timespec now = current_kernel_time();
    + struct timespec64 now = current_kernel_time64();
    ...
    - return timespec_trunc(
    + return timespec64_trunc(
    ... );
    }

    @ depends on patch @
    identifier xtime;
    @@
    struct \( iattr \| inode \| kstat \) {
    ...
    - struct timespec xtime;
    + struct timespec64 xtime;
    ...
    }

    @ depends on patch @
    identifier t;
    @@
    struct inode_operations {
    ...
    int (*update_time) (...,
    - struct timespec t,
    + struct timespec64 t,
    ...);
    ...
    }

    @ depends on patch @
    identifier t;
    identifier fn_update_time =~ "update_time$";
    @@
    fn_update_time (...,
    - struct timespec *t,
    + struct timespec64 *t,
    ...) { ... }

    @ depends on patch @
    identifier t;
    @@
    lease_get_mtime( ... ,
    - struct timespec *t
    + struct timespec64 *t
    ) { ... }

    @te depends on patch forall@
    identifier ts;
    local idexpression struct inode *inode_node;
    identifier i_xtime =~ "^i_[acm]time$";
    identifier ia_xtime =~ "^ia_[acm]time$";
    identifier fn_update_time =~ "update_time$";
    identifier fn;
    expression e, E3;
    local idexpression struct inode *node1;
    local idexpression struct inode *node2;
    local idexpression struct iattr *attr1;
    local idexpression struct iattr *attr2;
    local idexpression struct iattr attr;
    identifier i_xtime1 =~ "^i_[acm]time$";
    identifier i_xtime2 =~ "^i_[acm]time$";
    identifier ia_xtime1 =~ "^ia_[acm]time$";
    identifier ia_xtime2 =~ "^ia_[acm]time$";
    @@
    (
    (
    - struct timespec ts;
    + struct timespec64 ts;
    |
    - struct timespec ts = current_time(inode_node);
    + struct timespec64 ts = current_time(inode_node);
    )

    i_xtime, &ts)
    + timespec64_equal(&inode_node->i_xtime, &ts)
    |
    - timespec_equal(&ts, &inode_node->i_xtime)
    + timespec64_equal(&ts, &inode_node->i_xtime)
    |
    - timespec_compare(&inode_node->i_xtime, &ts)
    + timespec64_compare(&inode_node->i_xtime, &ts)
    |
    - timespec_compare(&ts, &inode_node->i_xtime)
    + timespec64_compare(&ts, &inode_node->i_xtime)
    |
    ts = current_time(e)
    |
    fn_update_time(..., &ts,...)
    |
    inode_node->i_xtime = ts
    |
    node1->i_xtime = ts
    |
    ts = inode_node->i_xtime
    |
    ia_xtime ...+> = ts
    |
    ts = attr1->ia_xtime
    |
    ts.tv_sec
    |
    ts.tv_nsec
    |
    btrfs_set_stack_timespec_sec(..., ts.tv_sec)
    |
    btrfs_set_stack_timespec_nsec(..., ts.tv_nsec)
    |
    - ts = timespec64_to_timespec(
    + ts =
    ...
    -)
    |
    - ts = ktime_to_timespec(
    + ts = ktime_to_timespec64(
    ...)
    |
    - ts = E3
    + ts = timespec_to_timespec64(E3)
    |
    - ktime_get_real_ts(&ts)
    + ktime_get_real_ts64(&ts)
    |
    fn(...,
    - ts
    + timespec64_to_timespec(ts)
    ,...)
    )
    ...+>
    (

    )
    |
    - timespec_equal(&node1->i_xtime1, &node2->i_xtime2)
    + timespec64_equal(&node1->i_xtime2, &node2->i_xtime2)
    |
    - timespec_equal(&node1->i_xtime1, &attr2->ia_xtime2)
    + timespec64_equal(&node1->i_xtime2, &attr2->ia_xtime2)
    |
    - timespec_compare(&node1->i_xtime1, &node2->i_xtime2)
    + timespec64_compare(&node1->i_xtime1, &node2->i_xtime2)
    |
    node1->i_xtime1 =
    - timespec_trunc(attr1->ia_xtime1,
    + timespec64_trunc(attr1->ia_xtime1,
    ...)
    |
    - attr1->ia_xtime1 = timespec_trunc(attr2->ia_xtime2,
    + attr1->ia_xtime1 = timespec64_trunc(attr2->ia_xtime2,
    ...)
    |
    - ktime_get_real_ts(&attr1->ia_xtime1)
    + ktime_get_real_ts64(&attr1->ia_xtime1)
    |
    - ktime_get_real_ts(&attr.ia_xtime1)
    + ktime_get_real_ts64(&attr.ia_xtime1)
    )

    @ depends on patch @
    struct inode *node;
    struct iattr *attr;
    identifier fn;
    identifier i_xtime =~ "^i_[acm]time$";
    identifier ia_xtime =~ "^ia_[acm]time$";
    expression e;
    @@
    (
    - fn(node->i_xtime);
    + fn(timespec64_to_timespec(node->i_xtime));
    |
    fn(...,
    - node->i_xtime);
    + timespec64_to_timespec(node->i_xtime));
    |
    - e = fn(attr->ia_xtime);
    + e = fn(timespec64_to_timespec(attr->ia_xtime));
    )

    @ depends on patch forall @
    struct inode *node;
    struct iattr *attr;
    identifier i_xtime =~ "^i_[acm]time$";
    identifier ia_xtime =~ "^ia_[acm]time$";
    identifier fn;
    @@
    {
    + struct timespec ts;
    i_xtime);
    fn (...,
    - &node->i_xtime,
    + &ts,
    ...);
    |
    + ts = timespec64_to_timespec(attr->ia_xtime);
    fn (...,
    - &attr->ia_xtime,
    + &ts,
    ...);
    )
    ...+>
    }

    @ depends on patch forall @
    struct inode *node;
    struct iattr *attr;
    struct kstat *stat;
    identifier ia_xtime =~ "^ia_[acm]time$";
    identifier i_xtime =~ "^i_[acm]time$";
    identifier xtime =~ "^[acm]time$";
    identifier fn, ret;
    @@
    {
    + struct timespec ts;
    i_xtime);
    ret = fn (...,
    - &node->i_xtime,
    + &ts,
    ...);
    |
    + ts = timespec64_to_timespec(node->i_xtime);
    ret = fn (...,
    - &node->i_xtime);
    + &ts);
    |
    + ts = timespec64_to_timespec(attr->ia_xtime);
    ret = fn (...,
    - &attr->ia_xtime,
    + &ts,
    ...);
    |
    + ts = timespec64_to_timespec(attr->ia_xtime);
    ret = fn (...,
    - &attr->ia_xtime);
    + &ts);
    |
    + ts = timespec64_to_timespec(stat->xtime);
    ret = fn (...,
    - &stat->xtime);
    + &ts);
    )
    ...+>
    }

    @ depends on patch @
    struct inode *node;
    struct inode *node2;
    identifier i_xtime1 =~ "^i_[acm]time$";
    identifier i_xtime2 =~ "^i_[acm]time$";
    identifier i_xtime3 =~ "^i_[acm]time$";
    struct iattr *attrp;
    struct iattr *attrp2;
    struct iattr attr ;
    identifier ia_xtime1 =~ "^ia_[acm]time$";
    identifier ia_xtime2 =~ "^ia_[acm]time$";
    struct kstat *stat;
    struct kstat stat1;
    struct timespec64 ts;
    identifier xtime =~ "^[acmb]time$";
    expression e;
    @@
    (
    ( node->i_xtime2 \| attrp->ia_xtime2 \| attr.ia_xtime2 \) = node->i_xtime1 ;
    |
    node->i_xtime2 = \( node2->i_xtime1 \| timespec64_trunc(...) \);
    |
    node->i_xtime2 = node->i_xtime1 = node->i_xtime3 = \(ts \| current_time(...) \);
    |
    node->i_xtime1 = node->i_xtime3 = \(ts \| current_time(...) \);
    |
    stat->xtime = node2->i_xtime1;
    |
    stat1.xtime = node2->i_xtime1;
    |
    ( node->i_xtime2 \| attrp->ia_xtime2 \) = attrp->ia_xtime1 ;
    |
    ( attrp->ia_xtime1 \| attr.ia_xtime1 \) = attrp2->ia_xtime2;
    |
    - e = node->i_xtime1;
    + e = timespec64_to_timespec( node->i_xtime1 );
    |
    - e = attrp->ia_xtime1;
    + e = timespec64_to_timespec( attrp->ia_xtime1 );
    |
    node->i_xtime1 = current_time(...);
    |
    node->i_xtime2 = node->i_xtime1 = node->i_xtime3 =
    - e;
    + timespec_to_timespec64(e);
    |
    node->i_xtime1 = node->i_xtime3 =
    - e;
    + timespec_to_timespec64(e);
    |
    - node->i_xtime1 = e;
    + node->i_xtime1 = timespec_to_timespec64(e);
    )

    Signed-off-by: Deepa Dinamani
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:

    Deepa Dinamani
     

31 May, 2018

2 commits

  • Split out common helper for race free insertion of an already allocated
    inode into the cache. Use this from iget5_locked() and
    insert_inode_locked4(). Make iget5_locked() use new_inode()/iput() instead
    of alloc_inode()/destroy_inode() directly.

    Also export to modules for use by filesystems which want to preallocate an
    inode before file/directory creation.

    Signed-off-by: Amir Goldstein
    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • In inode_init_always(), we clear the inode mapping flags, which clears
    any retained error (AS_EIO, AS_ENOSPC) bits. Unfortunately, we do not
    also clear wb_err, which means that old mapping errors can leak through
    to new inodes.

    This is crucial for the XFS inode allocation path because we recycle old
    in-core inodes and we do not want error state from an old file to leak
    into the new file. This bug was discovered by running generic/036 and
    generic/047 in a loop and noticing that the EIOs generated by the
    collision of direct and buffered writes in generic/036 would survive the
    remount between 036 and 047, and get reported to the fsyncs (on
    different files!) in generic/047.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Jeff Layton
    Reviewed-by: Brian Foster

    Darrick J. Wong
     

26 May, 2018

1 commit

  • As vfs moves to using struct timespec64 to represent times,
    update the argument to timespec_truncate() to use
    struct timespec64. Also change the name of the function.
    The rest of the implementation logic is the same.

    Move this to fs/inode.c instead of kernel/time/time.c as all the
    users of this api are filesystems.

    Signed-off-by: Deepa Dinamani
    Cc:

    Deepa Dinamani
     

12 Apr, 2018

1 commit

  • Remove the address_space ->tree_lock and use the xa_lock newly added to
    the radix_tree_root. Rename the address_space ->page_tree to ->i_pages,
    since we don't really care that it's a tree.

    [willy@infradead.org: fix nds32, fs/dax.c]
    Link: http://lkml.kernel.org/r/20180406145415.GB20605@bombadil.infradead.orgLink: http://lkml.kernel.org/r/20180313132639.17387-9-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Acked-by: Jeff Layton
    Cc: Darrick J. Wong
    Cc: Dave Chinner
    Cc: Ryusuke Konishi
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     

12 Mar, 2018

1 commit

  • Noticed when looking at why cycling 600k inodes/s through the inode
    cache was taking a total of 8% cpu in memset() during inode
    initialisation. There is no need to zero the inode.i_data structure
    twice.

    This increases single threaded bulkstat throughput from ~200,000
    inodes/s to ~220,000 inodes/s, so we save a substantial amount of
    CPU time per inode init by doing this.

    Signed-Off-By: Dave Chinner
    Reviewed-by: Jan Kara
    Reviewed-by: Carlos Maiolino
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Dave Chinner