25 Jun, 2020

1 commit


12 Jun, 2020

1 commit


10 Jun, 2020

1 commit

  • Pull overlayfs updates from Miklos Szeredi:
    "Fixes:

    - Resolve mount option conflicts consistently

    - Sync before remount R/O

    - Fix file handle encoding corner cases

    - Fix metacopy related issues

    - Fix an unintialized return value

    - Add missing permission checks for underlying layers

    Optimizations:

    - Allow multipe whiteouts to share an inode

    - Optimize small writes by inheriting SB_NOSEC from upper layer

    - Do not call ->syncfs() multiple times for sync(2)

    - Do not cache negative lookups on upper layer

    - Make private internal mounts longterm"

    * tag 'ovl-update-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: (27 commits)
    ovl: remove unnecessary lock check
    ovl: make oip->index bool
    ovl: only pass ->ki_flags to ovl_iocb_to_rwf()
    ovl: make private mounts longterm
    ovl: get rid of redundant members in struct ovl_fs
    ovl: add accessor for ofs->upper_mnt
    ovl: initialize error in ovl_copy_xattr
    ovl: drop negative dentry in upper layer
    ovl: check permission to open real file
    ovl: call secutiry hook in ovl_real_ioctl()
    ovl: verify permissions in ovl_path_open()
    ovl: switch to mounter creds in readdir
    ovl: pass correct flags for opening real directory
    ovl: fix redirect traversal on metacopy dentries
    ovl: initialize OVL_UPPERDATA in ovl_lookup()
    ovl: use only uppermetacopy state in ovl_lookup()
    ovl: simplify setting of origin for index lookup
    ovl: fix out of bounds access warning in ovl_check_fb_len()
    ovl: return required buffer size for file handles
    ovl: sync dirty data when remounting to ro mode
    ...

    Linus Torvalds
     

03 Jun, 2020

1 commit

  • Patch series "vfs: have syncfs() return error when there are writeback
    errors", v6.

    Currently, syncfs does not return errors when one of the inodes fails to
    be written back. It will return errors based on the legacy AS_EIO and
    AS_ENOSPC flags when syncing out the block device fails, but that's not
    particularly helpful for filesystems that aren't backed by a blockdev.
    It's also possible for a stray sync to lose those errors.

    The basic idea in this set is to track writeback errors at the
    superblock level, so that we can quickly and easily check whether
    something bad happened without having to fsync each file individually.
    syncfs is then changed to reliably report writeback errors after they
    occur, much in the same fashion as fsync does now.

    This patch (of 2):

    Usually we suggest that applications call fsync when they want to ensure
    that all data written to the file has made it to the backing store, but
    that can be inefficient when there are a lot of open files.

    Calling syncfs on the filesystem can be more efficient in some
    situations, but the error reporting doesn't currently work the way most
    people expect. If a single inode on a filesystem reports a writeback
    error, syncfs won't necessarily return an error. syncfs only returns an
    error if __sync_blockdev fails, and on some filesystems that's a no-op.

    It would be better if syncfs reported an error if there were any
    writeback failures. Then applications could call syncfs to see if there
    are any errors on any open files, and could then call fsync on all of
    the other descriptors to figure out which one failed.

    This patch adds a new errseq_t to struct super_block, and has
    mapping_set_error also record writeback errors there.

    To report those errors, we also need to keep an errseq_t in struct file
    to act as a cursor. This patch adds a dedicated field for that purpose,
    which slots nicely into 4 bytes of padding at the end of struct file on
    x86_64.

    An earlier version of this patch used an O_PATH file descriptor to cue
    the kernel that the open file should track the superblock error and not
    the inode's writeback error.

    I think that API is just too weird though. This is simpler and should
    make syncfs error reporting "just work" even if someone is multiplexing
    fsync and syncfs on the same fds.

    Signed-off-by: Jeff Layton
    Signed-off-by: Andrew Morton
    Reviewed-by: Jan Kara
    Cc: Andres Freund
    Cc: Matthew Wilcox
    Cc: Al Viro
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: David Howells
    Link: http://lkml.kernel.org/r/20200428135155.19223-1-jlayton@kernel.org
    Link: http://lkml.kernel.org/r/20200428135155.19223-2-jlayton@kernel.org
    Signed-off-by: Linus Torvalds

    Jeff Layton
     

13 May, 2020

1 commit

  • Stacked filesystems like overlayfs has no own writeback, but they have to
    forward syncfs() requests to backend for keeping data integrity.

    During global sync() each overlayfs instance calls method ->sync_fs() for
    backend although it itself is in global list of superblocks too. As a
    result one syscall sync() could write one superblock several times and send
    multiple disk barriers.

    This patch adds flag SB_I_SKIP_SYNC into sb->sb_iflags to avoid that.

    Reported-by: Dmitry Monakhov
    Signed-off-by: Konstantin Khlebnikov
    Reviewed-by: Amir Goldstein
    Signed-off-by: Miklos Szeredi

    Konstantin Khlebnikov
     

21 May, 2019

1 commit


15 May, 2019

1 commit

  • 23d0127096cb ("fs/sync.c: make sync_file_range(2) use WB_SYNC_NONE
    writeback") claims that sync_file_range(2) syscall was "created for
    userspace to be able to issue background writeout and so waiting for
    in-flight IO is undesirable there" and changes the writeback (back) to
    WB_SYNC_NONE.

    This claim is only partially true. It is true for users that use the flag
    SYNC_FILE_RANGE_WRITE by itself, as does PostgreSQL, the user that was the
    reason for changing to WB_SYNC_NONE writeback.

    However, that claim is not true for users that use that flag combination
    SYNC_FILE_RANGE_{WAIT_BEFORE|WRITE|_WAIT_AFTER}. Those users explicitly
    requested to wait for in-flight IO as well as to writeback of dirty pages.

    Re-brand that flag combination as SYNC_FILE_RANGE_WRITE_AND_WAIT and use
    WB_SYNC_ALL writeback to perform the full range sync request.

    Link: http://lkml.kernel.org/r/20190409114922.30095-1-amir73il@gmail.com
    Link: http://lkml.kernel.org/r/20190419072938.31320-1-amir73il@gmail.com
    Fixes: 23d0127096cb ("fs/sync.c: make sync_file_range(2) use WB_SYNC_NONE")
    Signed-off-by: Amir Goldstein
    Acked-by: Jan Kara
    Cc: Dave Chinner
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Amir Goldstein
     

04 May, 2019

1 commit

  • This adds a counter to the taskstats extended accounting fields, which
    tracks the number of times fsync is called, and then plumbs it through
    to the uid_sys_stats driver.

    Bug: 120442023
    Change-Id: I6c138de5b2332eea70f57e098134d1d141247b3f
    Signed-off-by: Jin Qian
    [AmitP: Refactored changes to align with changes from upstream commit
    9a07000400c8 ("sched/headers: Move CONFIG_TASK_XACCT bits from to ")]
    Signed-off-by: Amit Pundir
    [tkjos: Needed for storaged fsync accounting ("storaged --uid" and
    "storaged --task").]
    [astrachan: This is modifying a userspace interface and should probably
    be reworked]
    Signed-off-by: Alistair Strachan

    Jin Qian
     

03 May, 2019

1 commit


05 Apr, 2018

1 commit

  • Pull xfs updates from Darrick Wong:
    "Here's the first round of fixes for XFS for 4.17.

    The biggest new features this time around are the addition of lazytime
    support, further enhancement of the on-disk inode metadata verifiers,
    and a patch to smooth over some of the AGFL padding problems that have
    intermittently plagued users since 4.5. I forsee sending a second pull
    request next week with further bug fixes and speedups in the online
    scrub code and elsewhere.

    This series has been run through a full xfstests run over the weekend
    and through a quick xfstests run against this morning's master, with
    no major failures reported.

    Summary of changes for this release:

    - Various cleanups and code fixes

    - Implement lazytime as a mount option

    - Convert various on-disk metadata checks from asserts to -EFSCORRUPTED

    - Fix accounting problems with the rmap per-ag reservations

    - Refactorings and cleanups for xfs_log_force

    - Various bugfixes for the reflink code

    - Work around v5 AGFL padding problems to prevent fs shutdowns

    - Establish inode fork verifiers to inspect on-disk metadata
    correctness

    - Various online scrub fixes

    - Fix v5 swapext blowing up on deleted inodes"

    * tag 'xfs-4.17-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (49 commits)
    xfs: do not log/recover swapext extent owner changes for deleted inodes
    xfs: clean up xfs_mount allocation and dynamic initializers
    xfs: remove dead inode version setting code
    xfs: catch inode allocation state mismatch corruption
    xfs: xfs_scrub_iallocbt_xref_rmap_inodes should use xref_set_corrupt
    xfs: flag inode corruption if parent ptr doesn't get us a real inode
    xfs: don't accept inode buffers with suspicious unlinked chains
    xfs: move inode extent size hint validation to libxfs
    xfs: record inode buf errors as a xref error in inobt scrubber
    xfs: remove xfs_buf parameter from inode scrub methods
    xfs: inode scrubber shouldn't bother with raw checks
    xfs: bmap scrubber should do rmap xref with bmap for sparse files
    xfs: refactor inode buffer verifier error logging
    xfs: refactor inode verifier error logging
    xfs: refactor bmap record validation
    xfs: sanity-check the unused space before trying to use it
    xfs: detect agfl count corruption and reset agfl
    xfs: unwind the try_again loop in xfs_log_force
    xfs: refactor xfs_log_force_lsn
    xfs: minor cleanup for xfs_reflink_end_cow
    ...

    Linus Torvalds
     

03 Apr, 2018

2 commits

  • Using this helper allows us to avoid the in-kernel calls to the
    sys_sync_file_range() syscall. The ksys_ prefix denotes that this function
    is meant as a drop-in replacement for the syscall. In particular, it uses
    the same calling convention as sys_sync_file_range().

    This patch is part of a series which removes in-kernel calls to syscalls.
    On this basis, the syscall entry path can be streamlined. For details, see
    http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

    Cc: Al Viro
    Cc: Andrew Morton
    Signed-off-by: Dominik Brodowski

    Dominik Brodowski
     
  • Using this helper allows us to avoid the in-kernel calls to the
    sys_sync() syscall. The ksys_ prefix denotes that this function
    is meant as a drop-in replacement for the syscall. In particular, it
    uses the same calling convention as sys_sync().

    This patch is part of a series which removes in-kernel calls to syscalls.
    On this basis, the syscall entry path can be streamlined. For details, see
    http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

    Cc: Alexander Viro
    Signed-off-by: Dominik Brodowski

    Dominik Brodowski
     

12 Mar, 2018

1 commit


15 Nov, 2017

1 commit

  • Pull core block layer updates from Jens Axboe:
    "This is the main pull request for block storage for 4.15-rc1.

    Nothing out of the ordinary in here, and no API changes or anything
    like that. Just various new features for drivers, core changes, etc.
    In particular, this pull request contains:

    - A patch series from Bart, closing the whole on blk/scsi-mq queue
    quescing.

    - A series from Christoph, building towards hidden gendisks (for
    multipath) and ability to move bio chains around.

    - NVMe
    - Support for native multipath for NVMe (Christoph).
    - Userspace notifications for AENs (Keith).
    - Command side-effects support (Keith).
    - SGL support (Chaitanya Kulkarni)
    - FC fixes and improvements (James Smart)
    - Lots of fixes and tweaks (Various)

    - bcache
    - New maintainer (Michael Lyle)
    - Writeback control improvements (Michael)
    - Various fixes (Coly, Elena, Eric, Liang, et al)

    - lightnvm updates, mostly centered around the pblk interface
    (Javier, Hans, and Rakesh).

    - Removal of unused bio/bvec kmap atomic interfaces (me, Christoph)

    - Writeback series that fix the much discussed hundreds of millions
    of sync-all units. This goes all the way, as discussed previously
    (me).

    - Fix for missing wakeup on writeback timer adjustments (Yafang
    Shao).

    - Fix laptop mode on blk-mq (me).

    - {mq,name} tupple lookup for IO schedulers, allowing us to have
    alias names. This means you can use 'deadline' on both !mq and on
    mq (where it's called mq-deadline). (me).

    - blktrace race fix, oopsing on sg load (me).

    - blk-mq optimizations (me).

    - Obscure waitqueue race fix for kyber (Omar).

    - NBD fixes (Josef).

    - Disable writeback throttling by default on bfq, like we do on cfq
    (Luca Miccio).

    - Series from Ming that enable us to treat flush requests on blk-mq
    like any other request. This is a really nice cleanup.

    - Series from Ming that improves merging on blk-mq with schedulers,
    getting us closer to flipping the switch on scsi-mq again.

    - BFQ updates (Paolo).

    - blk-mq atomic flags memory ordering fixes (Peter Z).

    - Loop cgroup support (Shaohua).

    - Lots of minor fixes from lots of different folks, both for core and
    driver code"

    * 'for-4.15/block' of git://git.kernel.dk/linux-block: (294 commits)
    nvme: fix visibility of "uuid" ns attribute
    blk-mq: fixup some comment typos and lengths
    ide: ide-atapi: fix compile error with defining macro DEBUG
    blk-mq: improve tag waiting setup for non-shared tags
    brd: remove unused brd_mutex
    blk-mq: only run the hardware queue if IO is pending
    block: avoid null pointer dereference on null disk
    fs: guard_bio_eod() needs to consider partitions
    xtensa/simdisk: fix compile error
    nvme: expose subsys attribute to sysfs
    nvme: create 'slaves' and 'holders' entries for hidden controllers
    block: create 'slaves' and 'holders' entries for hidden gendisks
    nvme: also expose the namespace identification sysfs files for mpath nodes
    nvme: implement multipath access to nvme subsystems
    nvme: track shared namespaces
    nvme: introduce a nvme_ns_ids structure
    nvme: track subsystems
    block, nvme: Introduce blk_mq_req_flags_t
    block, scsi: Make SCSI quiesce and resume work reliably
    block: Add the QUEUE_FLAG_PREEMPT_ONLY request queue flag
    ...

    Linus Torvalds
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

03 Oct, 2017

1 commit


15 Sep, 2017

1 commit

  • Pull mount flag updates from Al Viro:
    "Another chunk of fmount preparations from dhowells; only trivial
    conflicts for that part. It separates MS_... bits (very grotty
    mount(2) ABI) from the struct super_block ->s_flags (kernel-internal,
    only a small subset of MS_... stuff).

    This does *not* convert the filesystems to new constants; only the
    infrastructure is done here. The next step in that series is where the
    conflicts would be; that's the conversion of filesystems. It's purely
    mechanical and it's better done after the merge, so if you could run
    something like

    list=$(for i in MS_RDONLY MS_NOSUID MS_NODEV MS_NOEXEC MS_SYNCHRONOUS MS_MANDLOCK MS_DIRSYNC MS_NOATIME MS_NODIRATIME MS_SILENT MS_POSIXACL MS_KERNMOUNT MS_I_VERSION MS_LAZYTIME; do git grep -l $i fs drivers/staging/lustre drivers/mtd ipc mm include/linux; done|sort|uniq|grep -v '^fs/namespace.c$')

    sed -i -e 's/\/SB_RDONLY/g' \
    -e 's/\/SB_NOSUID/g' \
    -e 's/\/SB_NODEV/g' \
    -e 's/\/SB_NOEXEC/g' \
    -e 's/\/SB_SYNCHRONOUS/g' \
    -e 's/\/SB_MANDLOCK/g' \
    -e 's/\/SB_DIRSYNC/g' \
    -e 's/\/SB_NOATIME/g' \
    -e 's/\/SB_NODIRATIME/g' \
    -e 's/\/SB_SILENT/g' \
    -e 's/\/SB_POSIXACL/g' \
    -e 's/\/SB_KERNMOUNT/g' \
    -e 's/\/SB_I_VERSION/g' \
    -e 's/\/SB_LAZYTIME/g' \
    $list

    and commit it with something along the lines of 'convert filesystems
    away from use of MS_... constants' as commit message, it would save a
    quite a bit of headache next cycle"

    * 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    VFS: Differentiate mount flags (MS_*) from internal superblock flags
    VFS: Convert sb->s_flags & MS_RDONLY to sb_rdonly(sb)
    vfs: Add sb_rdonly(sb) to query the MS_RDONLY flag on s_flags

    Linus Torvalds
     

07 Sep, 2017

2 commits

  • Merge updates from Andrew Morton:

    - various misc bits

    - DAX updates

    - OCFS2

    - most of MM

    * emailed patches from Andrew Morton : (119 commits)
    mm,fork: introduce MADV_WIPEONFORK
    x86,mpx: make mpx depend on x86-64 to free up VMA flag
    mm: add /proc/pid/smaps_rollup
    mm: hugetlb: clear target sub-page last when clearing huge page
    mm: oom: let oom_reap_task and exit_mmap run concurrently
    swap: choose swap device according to numa node
    mm: replace TIF_MEMDIE checks by tsk_is_oom_victim
    mm, oom: do not rely on TIF_MEMDIE for memory reserves access
    z3fold: use per-cpu unbuddied lists
    mm, swap: don't use VMA based swap readahead if HDD is used as swap
    mm, swap: add sysfs interface for VMA based swap readahead
    mm, swap: VMA based swap readahead
    mm, swap: fix swap readahead marking
    mm, swap: add swap readahead hit statistics
    mm/vmalloc.c: don't reinvent the wheel but use existing llist API
    mm/vmstat.c: fix wrong comment
    selftests/memfd: add memfd_create hugetlbfs selftest
    mm/shmem: add hugetlbfs support to memfd_create()
    mm, devm_memremap_pages: use multi-order radix for ZONE_DEVICE lookups
    mm/vmalloc.c: halve the number of comparisons performed in pcpu_get_vm_areas()
    ...

    Linus Torvalds
     
  • fsync codepath assumes that f_mapping can never be NULL, but
    sync_file_range has a check for that.

    Remove the one from sync_file_range as I don't see how you'd ever get a
    NULL pointer in here.

    Link: http://lkml.kernel.org/r/20170525110509.9434-1-jlayton@redhat.com
    Signed-off-by: Jeff Layton
    Cc: Alexander Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeff Layton
     

01 Aug, 2017

1 commit

  • sync_file_range doesn't call down into the filesystem directly at all.
    It only kicks off writeback of pagecache pages and optionally waits
    on the result.

    Convert sync_file_range to use errseq_t based error tracking, under the
    assumption that most users will prefer this behavior when errors occur.

    Reviewed-by: Jan Kara
    Signed-off-by: Jeff Layton

    Jeff Layton
     

17 Jul, 2017

1 commit

  • Firstly by applying the following with coccinelle's spatch:

    @@ expression SB; @@
    -SB->s_flags & MS_RDONLY
    +sb_rdonly(SB)

    to effect the conversion to sb_rdonly(sb), then by applying:

    @@ expression A, SB; @@
    (
    -(!sb_rdonly(SB)) && A
    +!sb_rdonly(SB) && A
    |
    -A != (sb_rdonly(SB))
    +A != sb_rdonly(SB)
    |
    -A == (sb_rdonly(SB))
    +A == sb_rdonly(SB)
    |
    -!(sb_rdonly(SB))
    +!sb_rdonly(SB)
    |
    -A && (sb_rdonly(SB))
    +A && sb_rdonly(SB)
    |
    -A || (sb_rdonly(SB))
    +A || sb_rdonly(SB)
    |
    -(sb_rdonly(SB)) != A
    +sb_rdonly(SB) != A
    |
    -(sb_rdonly(SB)) == A
    +sb_rdonly(SB) == A
    |
    -(sb_rdonly(SB)) && A
    +sb_rdonly(SB) && A
    |
    -(sb_rdonly(SB)) || A
    +sb_rdonly(SB) || A
    )

    @@ expression A, B, SB; @@
    (
    -(sb_rdonly(SB)) ? 1 : 0
    +sb_rdonly(SB)
    |
    -(sb_rdonly(SB)) ? A : B
    +sb_rdonly(SB) ? A : B
    )

    to remove left over excess bracketage and finally by applying:

    @@ expression A, SB; @@
    (
    -(A & MS_RDONLY) != sb_rdonly(SB)
    +(bool)(A & MS_RDONLY) != sb_rdonly(SB)
    |
    -(A & MS_RDONLY) == sb_rdonly(SB)
    +(bool)(A & MS_RDONLY) == sb_rdonly(SB)
    )

    to make comparisons against the result of sb_rdonly() (which is a bool)
    work correctly.

    Signed-off-by: David Howells

    David Howells
     

06 Jul, 2017

1 commit


20 Feb, 2017

1 commit


05 Apr, 2016

1 commit

  • PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
    ago with promise that one day it will be possible to implement page
    cache with bigger chunks than PAGE_SIZE.

    This promise never materialized. And unlikely will.

    We have many places where PAGE_CACHE_SIZE assumed to be equal to
    PAGE_SIZE. And it's constant source of confusion on whether
    PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
    especially on the border between fs and mm.

    Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
    breakage to be doable.

    Let's stop pretending that pages in page cache are special. They are
    not.

    The changes are pretty straight-forward:

    - << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

    - page_cache_get() -> get_page();

    - page_cache_release() -> put_page();

    This patch contains automated changes generated with coccinelle using
    script below. For some reason, coccinelle doesn't patch header files.
    I've called spatch for them manually.

    The only adjustment after coccinelle is revert of changes to
    PAGE_CAHCE_ALIGN definition: we are going to drop it later.

    There are few places in the code where coccinelle didn't reach. I'll
    fix them manually in a separate patch. Comments and documentation also
    will be addressed with the separate patch.

    virtual patch

    @@
    expression E;
    @@
    - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    expression E;
    @@
    - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    @@
    - PAGE_CACHE_SHIFT
    + PAGE_SHIFT

    @@
    @@
    - PAGE_CACHE_SIZE
    + PAGE_SIZE

    @@
    @@
    - PAGE_CACHE_MASK
    + PAGE_MASK

    @@
    expression E;
    @@
    - PAGE_CACHE_ALIGN(E)
    + PAGE_ALIGN(E)

    @@
    expression E;
    @@
    - page_cache_get(E)
    + get_page(E)

    @@
    expression E;
    @@
    - page_cache_release(E)
    + put_page(E)

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

07 Nov, 2015

1 commit

  • sync_file_range(2) is documented to issue writeback only for pages that
    are not currently being written. After all the system call has been
    created for userspace to be able to issue background writeout and so
    waiting for in-flight IO is undesirable there. However commit
    ee53a891f474 ("mm: do_sync_mapping_range integrity fix") switched
    do_sync_mapping_range() and thus sync_file_range() to issue writeback in
    WB_SYNC_ALL mode since do_sync_mapping_range() was used by other code
    relying on WB_SYNC_ALL semantics.

    These days do_sync_mapping_range() went away and we can switch
    sync_file_range(2) back to issuing WB_SYNC_NONE writeback. That should
    help PostgreSQL avoid large latency spikes when flushing data in the
    background.

    Andres measured a 20% increase in transactions per second on an SSD disk.

    Signed-off-by: Jan Kara
    Reported-by: Andres Freund
    Tested-By: Andres Freund
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

06 Nov, 2015

1 commit

  • filemap_fdatawait() is a function to wait for on-going writeback to
    complete but also consume and clear error status of the mapping set during
    writeback.

    The latter functionality is critical for applications to detect writeback
    error with system calls like fsync(2)/fdatasync(2).

    However filemap_fdatawait() is also used by sync(2) or FIFREEZE ioctl,
    which don't check error status of individual mappings.

    As a result, fsync() may not be able to detect writeback error if events
    happen in the following order:

    Application System admin
    ----------------------------------------------------------
    write data on page cache
    Run sync command
    writeback completes with error
    filemap_fdatawait() clears error
    fsync returns success
    (but the data is not on disk)

    This patch adds filemap_fdatawait_keep_errors() for call sites where
    writeback error is not handled so that they don't clear error status.

    Signed-off-by: Jun'ichi Nomura
    Acked-by: Andi Kleen
    Reviewed-by: Tejun Heo
    Cc: Fengguang Wu
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Junichi Nomura
     

05 Feb, 2015

1 commit

  • Add a new mount option which enables a new "lazytime" mode. This mode
    causes atime, mtime, and ctime updates to only be made to the
    in-memory version of the inode. The on-disk times will only get
    updated when (a) if the inode needs to be updated for some non-time
    related change, (b) if userspace calls fsync(), syncfs() or sync(), or
    (c) just before an undeleted inode is evicted from memory.

    This is OK according to POSIX because there are no guarantees after a
    crash unless userspace explicitly requests via a fsync(2) call.

    For workloads which feature a large number of random write to a
    preallocated file, the lazytime mount option significantly reduces
    writes to the inode table. The repeated 4k writes to a single block
    will result in undesirable stress on flash devices and SMR disk
    drives. Even on conventional HDD's, the repeated writes to the inode
    table block will trigger Adjacent Track Interference (ATI) remediation
    latencies, which very negatively impact long tail latencies --- which
    is a very big deal for web serving tiers (for example).

    Google-Bug-Id: 18297052

    Signed-off-by: Theodore Ts'o
    Signed-off-by: Al Viro

    Theodore Ts'o
     

20 Nov, 2014

1 commit


05 Sep, 2014

1 commit

  • This patch changes sync_filesystem() to be EXPORT_SYMBOL().

    The reason this is needed is that starting with 3.15 kernel, due to
    Theodore Ts'o's commit 02b9984d6408 ("fs: push sync_filesystem() down to
    the file system's remount_fs()"), all file systems that have dirty data
    to be written out need to call sync_filesystem() from their
    ->remount_fs() method when remounting read-only.

    As this is now a generically required function rather than an internal
    only function it should be EXPORT_SYMBOL() so that all file systems can
    call it.

    Signed-off-by: Anton Altaparmakov
    Acked-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anton Altaparmakov
     

22 Feb, 2014

1 commit

  • This reverts commit c4a391b53a72d2df4ee97f96f78c1d5971b47489. Dave
    Chinner has reported the commit may cause some
    inodes to be left out from sync(2). This is because we can call
    redirty_tail() for some inode (which sets i_dirtied_when to current time)
    after sync(2) has started or similarly requeue_inode() can set
    i_dirtied_when to current time if writeback had to skip some pages. The
    real problem is in the functions clobbering i_dirtied_when but fixing
    that isn't trivial so revert is a safer choice for now.

    CC: stable@vger.kernel.org # >= 3.13
    Signed-off-by: Jan Kara

    Jan Kara
     

10 Feb, 2014

1 commit

  • It actually goes back to 2004 ([PATCH] Concurrent O_SYNC write support)
    when sync_page_range() had been introduced; generic_file_write{,v}() correctly
    synced
    pos_after_write - written .. pos_after_write - 1
    but generic_file_aio_write() synced
    pos_before_write .. pos_before_write + written - 1
    instead. Which is not the same thing with O_APPEND, obviously.
    A couple of years later correct variant had been killed off when
    everything switched to use of generic_file_aio_write().

    All users of generic_file_aio_write() are affected, and the same bug
    has been copied into other instances of ->aio_write().

    The fix is trivial; the only subtle point is that generic_write_sync()
    ought to be inlined to avoid calculations useless for the majority of
    calls.

    Signed-off-by: Al Viro

    Al Viro
     

13 Nov, 2013

2 commits

  • Merge first patch-bomb from Andrew Morton:
    "Quite a lot of other stuff is banked up awaiting further
    next->mainline merging, but this batch contains:

    - Lots of random misc patches
    - OCFS2
    - Most of MM
    - backlight updates
    - lib/ updates
    - printk updates
    - checkpatch updates
    - epoll tweaking
    - rtc updates
    - hfs
    - hfsplus
    - documentation
    - procfs
    - update gcov to gcc-4.7 format
    - IPC"

    * emailed patches from Andrew Morton : (269 commits)
    ipc, msg: fix message length check for negative values
    ipc/util.c: remove unnecessary work pending test
    devpts: plug the memory leak in kill_sb
    ./Makefile: export initial ramdisk compression config option
    init/Kconfig: add option to disable kernel compression
    drivers: w1: make w1_slave::flags long to avoid memory corruption
    drivers/w1/masters/ds1wm.cuse dev_get_platdata()
    drivers/memstick/core/ms_block.c: fix unreachable state in h_msb_read_page()
    drivers/memstick/core/mspro_block.c: fix attributes array allocation
    drivers/pps/clients/pps-gpio.c: remove redundant of_match_ptr
    kernel/panic.c: reduce 1 byte usage for print tainted buffer
    gcov: reuse kbasename helper
    kernel/gcov/fs.c: use pr_warn()
    kernel/module.c: use pr_foo()
    gcov: compile specific gcov implementation based on gcc version
    gcov: add support for gcc 4.7 gcov format
    gcov: move gcov structs definitions to a gcc version specific file
    kernel/taskstats.c: return -ENOMEM when alloc memory fails in add_del_listener()
    kernel/taskstats.c: add nla_nest_cancel() for failure processing between nla_nest_start() and nla_nest_end()
    kernel/sysctl_binary.c: use scnprintf() instead of snprintf()
    ...

    Linus Torvalds
     
  • When there are processes heavily creating small files while sync(2) is
    running, it can easily happen that quite some new files are created
    between WB_SYNC_NONE and WB_SYNC_ALL pass of sync(2). That can happen
    especially if there are several busy filesystems (remember that sync
    traverses filesystems sequentially and waits in WB_SYNC_ALL phase on one
    fs before starting it on another fs). Because WB_SYNC_ALL pass is slow
    (e.g. causes a transaction commit and cache flush for each inode in
    ext3), resulting sync(2) times are rather large.

    The following script reproduces the problem:

    function run_writers
    {
    for (( i = 0; i < 10; i++ )); do
    mkdir $1/dir$i
    for (( j = 0; j < 40000; j++ )); do
    dd if=/dev/zero of=$1/dir$i/$j bs=4k count=4 &>/dev/null
    done &
    done
    }

    for dir in "$@"; do
    run_writers $dir
    done

    sleep 40
    time sync

    Fix the problem by disregarding inodes dirtied after sync(2) was called
    in the WB_SYNC_ALL pass. To allow for this, sync_inodes_sb() now takes
    a time stamp when sync has started which is used for setting up work for
    flusher threads.

    To give some numbers, when above script is run on two ext4 filesystems
    on simple SATA drive, the average sync time from 10 runs is 267.549
    seconds with standard deviation 104.799426. With the patched kernel,
    the average sync time from 10 runs is 2.995 seconds with standard
    deviation 0.096.

    Signed-off-by: Jan Kara
    Reviewed-by: Fengguang Wu
    Reviewed-by: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

25 Oct, 2013

1 commit


04 Mar, 2013

1 commit


23 Feb, 2013

1 commit


27 Sep, 2012

1 commit


23 Jul, 2012

3 commits

  • wakeup_flusher_threads(0) will queue work doing complete writeback for each
    flusher thread. Thus there is not much point in submitting another work doing
    full inode WB_SYNC_NONE writeback by writeback_inodes_sb().

    After this change it does not make sense to call nonblocking ->sync_fs and
    block device flush before calling sync_inodes_sb() because
    wakeup_flusher_threads() is completely asynchronous and thus these functions
    would be called in parallel with inode writeback running which will effectively
    void any work they do. So we move sync_inodes_sb() call before these two
    functions.

    Signed-off-by: Jan Kara
    Signed-off-by: Al Viro

    Jan Kara
     
  • It is not necessary to write block devices twice. The reason why we first did
    flush and then proper sync is that
    for_each_bdev() {
    write_bdev()
    wait_for_completion()
    }
    is much slower than
    for_each_bdev()
    write_bdev()
    for_each_bdev()
    wait_for_completion()
    when there is bigger amount of data. But as is seen in the above, there's no real
    need to scan pages and submit them twice. We just need to separate the submission
    and waiting part. This patch does that.

    Signed-off-by: Jan Kara
    Signed-off-by: Al Viro

    Jan Kara
     
  • In case block device does not have filesystem mounted on it, sys_sync will just
    ignore it and doesn't writeout its dirty pages. This is because writeback code
    avoids writing inodes from superblock without backing device and
    blockdev_superblock is such a superblock. Since it's unexpected that sync
    doesn't writeout dirty data for block devices be nice to users and change the
    behavior to do so. So now we iterate over all block devices on blockdev_super
    instead of iterating over all superblocks when syncing block devices.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jan Kara
    Signed-off-by: Al Viro

    Jan Kara