06 Dec, 2018

1 commit

  • commit ecebf55d27a11538ea84aee0be643dd953f830d5 upstream.

    The function ext2_xattr_set calls brelse(bh) to drop the reference count
    of bh. After that, bh may be freed. However, following brelse(bh),
    it reads bh->b_data via macro HDR(bh). This may result in a
    use-after-free bug. This patch moves brelse(bh) after reading field.

    CC: stable@vger.kernel.org
    Signed-off-by: Pan Bian
    Signed-off-by: Jan Kara
    Signed-off-by: Greg Kroah-Hartman

    Pan Bian
     

11 Jul, 2018

2 commits

  • commit 80660f20252d6f76c9f203874ad7c7a4a8508cf8 upstream.

    The function return values are confusing with the way the function is
    named. We expect a true or false return value but it actually returns
    0/-errno. This makes the code very confusing. Changing the return values
    to return a bool where if DAX is supported then return true and no DAX
    support returns false.

    Signed-off-by: Dave Jiang
    Signed-off-by: Ross Zwisler
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong
    Signed-off-by: Greg Kroah-Hartman

    Dave Jiang
     
  • commit ba23cba9b3bdc967aabdc6ff1e3e9b11ce05bb4f upstream.

    Change bdev_dax_supported so it takes a bdev parameter. This enables
    multi-device filesystems like xfs to check that a dax device can work for
    the particular filesystem. Once that's in place, actually fix all the
    parts of XFS where we need to be able to distinguish between datadev and
    rtdev.

    This patch fixes the problem where we screw up the dax support checking
    in xfs if the datadev and rtdev have different dax capabilities.

    Signed-off-by: Darrick J. Wong
    [rez: Re-added __bdev_dax_supported() for !CONFIG_FS_DAX cases]
    Signed-off-by: Ross Zwisler
    Reviewed-by: Eric Sandeen
    Signed-off-by: Greg Kroah-Hartman

    Darrick J. Wong
     

30 May, 2018

1 commit

  • commit 1e2e547a93a00ebc21582c06ca3c6cfea2a309ee upstream.

    For anything NFS-exported we do _not_ want to unlock new inode
    before it has grown an alias; original set of fixes got the
    ordering right, but missed the nasty complication in case of
    lockdep being enabled - unlock_new_inode() does
    lockdep_annotate_inode_mutex_key(inode)
    which can only be done before anyone gets a chance to touch
    ->i_mutex. Unfortunately, flipping the order and doing
    unlock_new_inode() before d_instantiate() opens a window when
    mkdir can race with open-by-fhandle on a guessed fhandle, leading
    to multiple aliases for a directory inode and all the breakage
    that follows from that.

    Correct solution: a new primitive (d_instantiate_new())
    combining these two in the right order - lockdep annotate, then
    d_instantiate(), then the rest of unlock_new_inode(). All
    combinations of d_instantiate() with unlock_new_inode() should
    be converted to that.

    Cc: stable@kernel.org # 2.6.29 and later
    Tested-by: Mike Marshall
    Reviewed-by: Andreas Dilger
    Signed-off-by: Al Viro
    Signed-off-by: Greg Kroah-Hartman

    Al Viro
     

25 May, 2018

1 commit

  • commit 5aa1437d2d9a068c0334bd7c9dafa8ec4f97f13b upstream.

    open file, unlink it, then use ioctl(2) to make it immutable or
    append only. Now close it and watch the blocks *not* freed...

    Immutable/append-only checks belong in ->setattr().
    Note: the bug is old and backport to anything prior to 737f2e93b972
    ("ext2: convert to use the new truncate convention") will need
    these checks lifted into ext2_setattr().

    Cc: stable@kernel.org
    Signed-off-by: Al Viro
    Signed-off-by: Greg Kroah-Hartman

    Al Viro
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

15 Sep, 2017

1 commit

  • Pull mount flag updates from Al Viro:
    "Another chunk of fmount preparations from dhowells; only trivial
    conflicts for that part. It separates MS_... bits (very grotty
    mount(2) ABI) from the struct super_block ->s_flags (kernel-internal,
    only a small subset of MS_... stuff).

    This does *not* convert the filesystems to new constants; only the
    infrastructure is done here. The next step in that series is where the
    conflicts would be; that's the conversion of filesystems. It's purely
    mechanical and it's better done after the merge, so if you could run
    something like

    list=$(for i in MS_RDONLY MS_NOSUID MS_NODEV MS_NOEXEC MS_SYNCHRONOUS MS_MANDLOCK MS_DIRSYNC MS_NOATIME MS_NODIRATIME MS_SILENT MS_POSIXACL MS_KERNMOUNT MS_I_VERSION MS_LAZYTIME; do git grep -l $i fs drivers/staging/lustre drivers/mtd ipc mm include/linux; done|sort|uniq|grep -v '^fs/namespace.c$')

    sed -i -e 's/\/SB_RDONLY/g' \
    -e 's/\/SB_NOSUID/g' \
    -e 's/\/SB_NODEV/g' \
    -e 's/\/SB_NOEXEC/g' \
    -e 's/\/SB_SYNCHRONOUS/g' \
    -e 's/\/SB_MANDLOCK/g' \
    -e 's/\/SB_DIRSYNC/g' \
    -e 's/\/SB_NOATIME/g' \
    -e 's/\/SB_NODIRATIME/g' \
    -e 's/\/SB_SILENT/g' \
    -e 's/\/SB_POSIXACL/g' \
    -e 's/\/SB_KERNMOUNT/g' \
    -e 's/\/SB_I_VERSION/g' \
    -e 's/\/SB_LAZYTIME/g' \
    $list

    and commit it with something along the lines of 'convert filesystems
    away from use of MS_... constants' as commit message, it would save a
    quite a bit of headache next cycle"

    * 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    VFS: Differentiate mount flags (MS_*) from internal superblock flags
    VFS: Convert sb->s_flags & MS_RDONLY to sb_rdonly(sb)
    vfs: Add sb_rdonly(sb) to query the MS_RDONLY flag on s_flags

    Linus Torvalds
     

12 Sep, 2017

1 commit

  • Pull libnvdimm from Dan Williams:
    "A rework of media error handling in the BTT driver and other updates.
    It has appeared in a few -next releases and collected some late-
    breaking build-error and warning fixups as a result.

    Summary:

    - Media error handling support in the Block Translation Table (BTT)
    driver is reworked to address sleeping-while-atomic locking and
    memory-allocation-context conflicts.

    - The dax_device lookup overhead for xfs and ext4 is moved out of the
    iomap hot-path to a mount-time lookup.

    - A new 'ecc_unit_size' sysfs attribute is added to advertise the
    read-modify-write boundary property of a persistent memory range.

    - Preparatory fix-ups for arm and powerpc pmem support are included
    along with other miscellaneous fixes"

    * tag 'libnvdimm-for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (26 commits)
    libnvdimm, btt: fix format string warnings
    libnvdimm, btt: clean up warning and error messages
    ext4: fix null pointer dereference on sbi
    libnvdimm, nfit: move the check on nd_reserved2 to the endpoint
    dax: fix FS_DAX=n BLOCK=y compilation
    libnvdimm: fix integer overflow static analysis warning
    libnvdimm, nd_blk: remove mmio_flush_range()
    libnvdimm, btt: rework error clearing
    libnvdimm: fix potential deadlock while clearing errors
    libnvdimm, btt: cache sector_size in arena_info
    libnvdimm, btt: ensure that flags were also unchanged during a map_read
    libnvdimm, btt: refactor map entry operations with macros
    libnvdimm, btt: fix a missed NVDIMM_IO_ATOMIC case in the write path
    libnvdimm, nfit: export an 'ecc_unit_size' sysfs attribute
    ext4: perform dax_device lookup at mount
    ext2: perform dax_device lookup at mount
    xfs: perform dax_device lookup at mount
    dax: introduce a fs_dax_get_by_bdev() helper
    libnvdimm, btt: check memory allocation failure
    libnvdimm, label: fix index block size calculation
    ...

    Linus Torvalds
     

07 Sep, 2017

1 commit

  • When servicing mmap() reads from file holes the current DAX code
    allocates a page cache page of all zeroes and places the struct page
    pointer in the mapping->page_tree radix tree.

    This has three major drawbacks:

    1) It consumes memory unnecessarily. For every 4k page that is read via
    a DAX mmap() over a hole, we allocate a new page cache page. This
    means that if you read 1GiB worth of pages, you end up using 1GiB of
    zeroed memory. This is easily visible by looking at the overall
    memory consumption of the system or by looking at /proc/[pid]/smaps:

    7f62e72b3000-7f63272b3000 rw-s 00000000 103:00 12 /root/dax/data
    Size: 1048576 kB
    Rss: 1048576 kB
    Pss: 1048576 kB
    Shared_Clean: 0 kB
    Shared_Dirty: 0 kB
    Private_Clean: 1048576 kB
    Private_Dirty: 0 kB
    Referenced: 1048576 kB
    Anonymous: 0 kB
    LazyFree: 0 kB
    AnonHugePages: 0 kB
    ShmemPmdMapped: 0 kB
    Shared_Hugetlb: 0 kB
    Private_Hugetlb: 0 kB
    Swap: 0 kB
    SwapPss: 0 kB
    KernelPageSize: 4 kB
    MMUPageSize: 4 kB
    Locked: 0 kB

    2) It is slower than using a common zero page because each page fault
    has more work to do. Instead of just inserting a common zero page we
    have to allocate a page cache page, zero it, and then insert it. Here
    are the average latencies of dax_load_hole() as measured by ftrace on
    a random test box:

    Old method, using zeroed page cache pages: 3.4 us
    New method, using the common 4k zero page: 0.8 us

    This was the average latency over 1 GiB of sequential reads done by
    this simple fio script:

    [global]
    size=1G
    filename=/root/dax/data
    fallocate=none
    [io]
    rw=read
    ioengine=mmap

    3) The fact that we had to check for both DAX exceptional entries and
    for page cache pages in the radix tree made the DAX code more
    complex.

    Solve these issues by following the lead of the DAX PMD code and using a
    common 4k zero page instead. As with the PMD code we will now insert a
    DAX exceptional entry into the radix tree instead of a struct page
    pointer which allows us to remove all the special casing in the DAX
    code.

    Note that we do still pretty aggressively check for regular pages in the
    DAX radix tree, especially where we take action based on the bits set in
    the page. If we ever find a regular page in our radix tree now that
    most likely means that someone besides DAX is inserting pages (which has
    happened lots of times in the past), and we want to find that out early
    and fail loudly.

    This solution also removes the extra memory consumption. Here is that
    same /proc/[pid]/smaps after 1GiB of reading from a hole with the new
    code:

    7f2054a74000-7f2094a74000 rw-s 00000000 103:00 12 /root/dax/data
    Size: 1048576 kB
    Rss: 0 kB
    Pss: 0 kB
    Shared_Clean: 0 kB
    Shared_Dirty: 0 kB
    Private_Clean: 0 kB
    Private_Dirty: 0 kB
    Referenced: 0 kB
    Anonymous: 0 kB
    LazyFree: 0 kB
    AnonHugePages: 0 kB
    ShmemPmdMapped: 0 kB
    Shared_Hugetlb: 0 kB
    Private_Hugetlb: 0 kB
    Swap: 0 kB
    SwapPss: 0 kB
    KernelPageSize: 4 kB
    MMUPageSize: 4 kB
    Locked: 0 kB

    Overall system memory consumption is similarly improved.

    Another major change is that we remove dax_pfn_mkwrite() from our fault
    flow, and instead rely on the page fault itself to make the PTE dirty
    and writeable. The following description from the patch adding the
    vm_insert_mixed_mkwrite() call explains this a little more:

    "To be able to use the common 4k zero page in DAX we need to have our
    PTE fault path look more like our PMD fault path where a PTE entry
    can be marked as dirty and writeable as it is first inserted rather
    than waiting for a follow-up dax_pfn_mkwrite() =>
    finish_mkwrite_fault() call.

    Right now we can rely on having a dax_pfn_mkwrite() call because we
    can distinguish between these two cases in do_wp_page():

    case 1: 4k zero page => writable DAX storage
    case 2: read-only DAX storage => writeable DAX storage

    This distinction is made by via vm_normal_page(). vm_normal_page()
    returns false for the common 4k zero page, though, just as it does
    for DAX ptes. Instead of special casing the DAX + 4k zero page case
    we will simplify our DAX PTE page fault sequence so that it matches
    our DAX PMD sequence, and get rid of the dax_pfn_mkwrite() helper.
    We will instead use dax_iomap_fault() to handle write-protection
    faults.

    This means that insert_pfn() needs to follow the lead of
    insert_pfn_pmd() and allow us to pass in a 'mkwrite' flag. If
    'mkwrite' is set insert_pfn() will do the work that was previously
    done by wp_page_reuse() as part of the dax_pfn_mkwrite() call path"

    Link: http://lkml.kernel.org/r/20170724170616.25810-4-ross.zwisler@linux.intel.com
    Signed-off-by: Ross Zwisler
    Reviewed-by: Jan Kara
    Cc: "Darrick J. Wong"
    Cc: "Theodore Ts'o"
    Cc: Alexander Viro
    Cc: Andreas Dilger
    Cc: Christoph Hellwig
    Cc: Dan Williams
    Cc: Dave Chinner
    Cc: Ingo Molnar
    Cc: Jonathan Corbet
    Cc: Matthew Wilcox
    Cc: Steven Rostedt
    Cc: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     

01 Sep, 2017

1 commit

  • The ->iomap_begin() operation is a hot path, so cache the
    fs_dax_get_by_host() result at mount time to avoid the incurring the
    hash lookup overhead on a per-i/o basis.

    Cc: "Theodore Ts'o"
    Cc: Andreas Dilger
    Reviewed-by: Jan Kara
    Reported-by: Christoph Hellwig
    Signed-off-by: Dan Williams

    Dan Williams
     

18 Jul, 2017

1 commit

  • When changing a file's acl mask, ext2_set_acl() will first set the group
    bits of i_mode to the value of the mask, and only then set the actual
    extended attribute representing the new acl.

    If the second part fails (due to lack of space, for example) and the file
    had no acl attribute to begin with, the system will from now on assume
    that the mask permission bits are actual group permission bits, potentially
    granting access to the wrong users.

    Prevent this by only changing the inode mode after the acl has been set.

    [JK: Rebased on top of "ext2: Don't clear SGID when inheriting ACLs"]
    Signed-off-by: Ernesto A. Fernández
    Signed-off-by: Jan Kara

    Ernesto A. Fernández
     

17 Jul, 2017

2 commits

  • When new directory 'DIR1' is created in a directory 'DIR0' with SGID bit
    set, DIR1 is expected to have SGID bit set (and owning group equal to
    the owning group of 'DIR0'). However when 'DIR0' also has some default
    ACLs that 'DIR1' inherits, setting these ACLs will result in SGID bit on
    'DIR1' to get cleared if user is not member of the owning group.

    Fix the problem by creating __ext2_set_acl() function that does not call
    posix_acl_update_mode() and use it when inheriting ACLs. That prevents
    SGID bit clearing and the mode has been properly set by
    posix_acl_create() anyway.

    Fixes: 073931017b49d9458aa351605b43a7e34598caef
    CC: stable@vger.kernel.org
    CC: linux-ext4@vger.kernel.org
    Signed-off-by: Jan Kara

    Jan Kara
     
  • Firstly by applying the following with coccinelle's spatch:

    @@ expression SB; @@
    -SB->s_flags & MS_RDONLY
    +sb_rdonly(SB)

    to effect the conversion to sb_rdonly(sb), then by applying:

    @@ expression A, SB; @@
    (
    -(!sb_rdonly(SB)) && A
    +!sb_rdonly(SB) && A
    |
    -A != (sb_rdonly(SB))
    +A != sb_rdonly(SB)
    |
    -A == (sb_rdonly(SB))
    +A == sb_rdonly(SB)
    |
    -!(sb_rdonly(SB))
    +!sb_rdonly(SB)
    |
    -A && (sb_rdonly(SB))
    +A && sb_rdonly(SB)
    |
    -A || (sb_rdonly(SB))
    +A || sb_rdonly(SB)
    |
    -(sb_rdonly(SB)) != A
    +sb_rdonly(SB) != A
    |
    -(sb_rdonly(SB)) == A
    +sb_rdonly(SB) == A
    |
    -(sb_rdonly(SB)) && A
    +sb_rdonly(SB) && A
    |
    -(sb_rdonly(SB)) || A
    +sb_rdonly(SB) || A
    )

    @@ expression A, B, SB; @@
    (
    -(sb_rdonly(SB)) ? 1 : 0
    +sb_rdonly(SB)
    |
    -(sb_rdonly(SB)) ? A : B
    +sb_rdonly(SB) ? A : B
    )

    to remove left over excess bracketage and finally by applying:

    @@ expression A, SB; @@
    (
    -(A & MS_RDONLY) != sb_rdonly(SB)
    +(bool)(A & MS_RDONLY) != sb_rdonly(SB)
    |
    -(A & MS_RDONLY) == sb_rdonly(SB)
    +(bool)(A & MS_RDONLY) == sb_rdonly(SB)
    )

    to make comparisons against the result of sb_rdonly() (which is a bool)
    work correctly.

    Signed-off-by: David Howells

    David Howells
     

14 Jul, 2017

1 commit

  • Pull ext2, udf, reiserfs fixes from Jan Kara:
    "Several ext2, udf, and reiserfs fixes"

    * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
    ext2: Fix memory leak when truncate races ext2_get_blocks
    reiserfs: fix race in prealloc discard
    reiserfs: don't preallocate blocks for extended attributes
    udf: Convert udf_disk_stamp_to_time() to use mktime64()
    udf: Use time64_to_tm for timestamp conversion
    udf: Fix deadlock between writeback and udf_setsize()
    udf: Use i_size_read() in udf_adinicb_writepage()
    udf: Fix races with i_size changes during readpage
    udf: Remove unused UDF_DEFAULT_BLOCKSIZE

    Linus Torvalds
     

13 Jul, 2017

1 commit

  • Buffer heads referencing indirect blocks may not be released if the file
    is truncated at the right time. This happens because ext2_get_branch()
    returns NULL when it finds the whole chain of indirect blocks already
    set, and when truncate alters the chain this value of NULL is
    treated as the address of the last head to be released. Handle this in the
    same way as it's done after the got_it label.

    Signed-off-by: Ernesto A. Fernández
    Signed-off-by: Jan Kara

    Ernesto A. Fernández
     

10 Jul, 2017

1 commit

  • Pull ext4 updates from Ted Ts'o:
    "The first major feature for ext4 this merge window is the largedir
    feature, which allows ext4 directories to support over 2 billion
    directory entries (assuming ~64 byte file names; in practice, users
    will run into practical performance limits first.) This feature was
    originally written by the Lustre team, and credit goes to Artem
    Blagodarenko from Seagate for getting this feature upstream.

    The second major major feature allows ext4 to support extended
    attribute values up to 64k. This feature was also originally from
    Lustre, and has been enhanced by Tahsin Erdogan from Google with a
    deduplication feature so that if multiple files have the same xattr
    value (for example, Windows ACL's stored by Samba), only one copy will
    be stored on disk for encoding and caching efficiency.

    We also have the usual set of bug fixes, cleanups, and optimizations"

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (47 commits)
    ext4: fix spelling mistake: "prellocated" -> "preallocated"
    ext4: fix __ext4_new_inode() journal credits calculation
    ext4: skip ext4_init_security() and encryption on ea_inodes
    fs: generic_block_bmap(): initialize all of the fields in the temp bh
    ext4: change fast symlink test to not rely on i_blocks
    ext4: require key for truncate(2) of encrypted file
    ext4: don't bother checking for encryption key in ->mmap()
    ext4: check return value of kstrtoull correctly in reserved_clusters_store
    ext4: fix off-by-one fsmap error on 1k block filesystems
    ext4: return EFSBADCRC if a bad checksum error is found in ext4_find_entry()
    ext4: return EIO on read error in ext4_find_entry
    ext4: forbid encrypting root directory
    ext4: send parallel discards on commit completions
    ext4: avoid unnecessary stalls in ext4_evict_inode()
    ext4: add nombcache mount option
    ext4: strong binding of xattr inode references
    ext4: eliminate xattr entry e_hash recalculation for removes
    ext4: reserve space for xattr entries/names
    quota: add get_inode_usage callback to transfer multi-inode charges
    ext4: xattr inode deduplication
    ...

    Linus Torvalds
     

08 Jul, 2017

1 commit

  • Pull Writeback error handling updates from Jeff Layton:
    "This pile represents the bulk of the writeback error handling fixes
    that I have for this cycle. Some of the earlier patches in this pile
    may look trivial but they are prerequisites for later patches in the
    series.

    The aim of this set is to improve how we track and report writeback
    errors to userland. Most applications that care about data integrity
    will periodically call fsync/fdatasync/msync to ensure that their
    writes have made it to the backing store.

    For a very long time, we have tracked writeback errors using two flags
    in the address_space: AS_EIO and AS_ENOSPC. Those flags are set when a
    writeback error occurs (via mapping_set_error) and are cleared as a
    side-effect of filemap_check_errors (as you noted yesterday). This
    model really sucks for userland.

    Only the first task to call fsync (or msync or fdatasync) will see the
    error. Any subsequent task calling fsync on a file will get back 0
    (unless another writeback error occurs in the interim). If I have
    several tasks writing to a file and calling fsync to ensure that their
    writes got stored, then I need to have them coordinate with one
    another. That's difficult enough, but in a world of containerized
    setups that coordination may even not be possible.

    But wait...it gets worse!

    The calls to filemap_check_errors can be buried pretty far down in the
    call stack, and there are internal callers of filemap_write_and_wait
    and the like that also end up clearing those errors. Many of those
    callers ignore the error return from that function or return it to
    userland at nonsensical times (e.g. truncate() or stat()). If I get
    back -EIO on a truncate, there is no reason to think that it was
    because some previous writeback failed, and a subsequent fsync() will
    (incorrectly) return 0.

    This pile aims to do three things:

    1) ensure that when a writeback error occurs that that error will be
    reported to userland on a subsequent fsync/fdatasync/msync call,
    regardless of what internal callers are doing

    2) report writeback errors on all file descriptions that were open at
    the time that the error occurred. This is a user-visible change,
    but I think most applications are written to assume this behavior
    anyway. Those that aren't are unlikely to be hurt by it.

    3) document what filesystems should do when there is a writeback
    error. Today, there is very little consistency between them, and a
    lot of cargo-cult copying. We need to make it very clear what
    filesystems should do in this situation.

    To achieve this, the set adds a new data type (errseq_t) and then
    builds new writeback error tracking infrastructure around that. Once
    all of that is in place, we change the filesystems to use the new
    infrastructure for reporting wb errors to userland.

    Note that this is just the initial foray into cleaning up this mess.
    There is a lot of work remaining here:

    1) convert the rest of the filesystems in a similar fashion. Once the
    initial set is in, then I think most other fs' will be fairly
    simple to convert. Hopefully most of those can in via individual
    filesystem trees.

    2) convert internal waiters on writeback to use errseq_t for
    detecting errors instead of relying on the AS_* flags. I have some
    draft patches for this for ext4, but they are not quite ready for
    prime time yet.

    This was a discussion topic this year at LSF/MM too. If you're
    interested in the gory details, LWN has some good articles about this:

    https://lwn.net/Articles/718734/
    https://lwn.net/Articles/724307/"

    * tag 'for-linus-v4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
    btrfs: minimal conversion to errseq_t writeback error reporting on fsync
    xfs: minimal conversion to errseq_t writeback error reporting
    ext4: use errseq_t based error handling for reporting data writeback errors
    fs: convert __generic_file_fsync to use errseq_t based reporting
    block: convert to errseq_t based writeback error tracking
    dax: set errors in mapping when writeback fails
    Documentation: flesh out the section in vfs.txt on storing and reporting writeback errors
    mm: set both AS_EIO/AS_ENOSPC and errseq_t in mapping_set_error
    fs: new infrastructure for writeback error handling and reporting
    lib: add errseq_t type and infrastructure for handling it
    mm: don't TestClearPageError in __filemap_fdatawait_range
    mm: clear AS_EIO/AS_ENOSPC when writeback initiation fails
    jbd2: don't clear and reset errors after waiting on writeback
    buffer: set errors in mapping at the time that the error occurs
    fs: check for writeback errors after syncing out buffers in generic_file_fsync
    buffer: use mapping_set_error instead of setting the flag
    mm: fix mapping_set_error call in me_pagecache_dirty

    Linus Torvalds
     

06 Jul, 2017

2 commits

  • ext2 currently does a test+clear of the AS_EIO flag, which is
    is problematic for some coming changes.

    What we really need to do instead is call filemap_check_errors
    in __generic_file_fsync after syncing out the buffers. That
    will be sufficient for this case, and help other callers detect
    these errors properly as well.

    With that, we don't need to twiddle it in ext2.

    Suggested-by: Jan Kara
    Signed-off-by: Jeff Layton
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Reviewed-by: Matthew Wilcox

    Jeff Layton
     
  • The callers all set it to 1.

    Also, make it clear that this function will not set any sort of AS_*
    error, and that the caller must do so if necessary. No existing caller
    uses this on normal files, so none of them need it.

    Also, add __must_check here since, in general, the callers need to handle
    an error here in some fashion.

    Link: http://lkml.kernel.org/r/20170525103303.6524-1-jlayton@redhat.com
    Signed-off-by: Jeff Layton
    Reviewed-by: Ross Zwisler
    Reviewed-by: Jan Kara
    Reviewed-by: Matthew Wilcox
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Andrew Morton

    Jeff Layton
     

22 Jun, 2017

2 commits

  • There will be a second mb_cache instance that tracks ea_inodes. Make
    existing names more explicit so that it is clear that they refer to
    xattr block cache.

    Signed-off-by: Tahsin Erdogan
    Signed-off-by: Theodore Ts'o

    Tahsin Erdogan
     
  • Make names more generic so that mbcache usage is not limited to
    block sharing. In a subsequent patch in the series
    ("ext4: xattr inode deduplication"), we start using the mbcache code
    for sharing xattr inodes. With that patch, old mb_cache_entry.e_block
    field could be holding either a block number or an inode number.

    Signed-off-by: Tahsin Erdogan
    Signed-off-by: Theodore Ts'o

    Tahsin Erdogan
     

14 May, 2017

1 commit

  • Tetsuo reports:

    fs/built-in.o: In function `xfs_file_iomap_end':
    xfs_iomap.c:(.text+0xe0ef9): undefined reference to `put_dax'
    fs/built-in.o: In function `xfs_file_iomap_begin':
    xfs_iomap.c:(.text+0xe1a7f): undefined reference to `dax_get_by_host'
    make: *** [vmlinux] Error 1
    $ grep DAX .config
    CONFIG_DAX=m
    # CONFIG_DEV_DAX is not set
    # CONFIG_FS_DAX is not set

    When FS_DAX=n we can/must throw away the dax code in filesystems.
    Implement 'fs_' versions of dax_get_by_host() and put_dax() that are
    nops in the FS_DAX=n case.

    Cc:
    Cc:
    Cc: Jan Kara
    Cc: "Theodore Ts'o"
    Cc: "Darrick J. Wong"
    Cc: Ross Zwisler
    Tested-by: Tony Luck
    Fixes: ef51042472f5 ("block, dax: move 'select DAX' from BLOCK to FS_DAX")
    Reported-by: Tetsuo Handa
    Signed-off-by: Dan Williams

    Dan Williams
     

13 May, 2017

1 commit

  • Pull libnvdimm fixes from Dan Williams:
    "Incremental fixes and a small feature addition on top of the main
    libnvdimm 4.12 pull request:

    - Geert noticed that tinyconfig was bloated by BLOCK selecting DAX.
    The size regression is fixed by moving all dax helpers into the
    dax-core and only specifying "select DAX" for FS_DAX and
    dax-capable drivers. He also asked for clarification of the
    NR_DEV_DAX config option which, on closer look, does not need to be
    a config option at all. Mike also throws in a DEV_DAX_PMEM fixup
    for good measure.

    - Ben's attention to detail on -stable patch submissions caught a
    case where the recent fixes to arch_copy_from_iter_pmem() missed a
    condition where we strand dirty data in the cache. This is tagged
    for -stable and will also be included in the rework of the pmem api
    to a proposed {memcpy,copy_user}_flushcache() interface for 4.13.

    - Vishal adds a feature that missed the initial pull due to pending
    review feedback. It allows the kernel to clear media errors when
    initializing a BTT (atomic sector update driver) instance on a pmem
    namespace.

    - Ross noticed that the dax_device + dax_operations conversion broke
    __dax_zero_page_range(). The nvdimm unit tests fail to check this
    path, but xfstests immediately trips over it. No excuse for missing
    this before submitting the 4.12 pull request.

    These all pass the nvdimm unit tests and an xfstests spot check. The
    set has received a build success notification from the kbuild robot"

    * 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
    filesystem-dax: fix broken __dax_zero_page_range() conversion
    libnvdimm, btt: ensure that initializing metadata clears poison
    libnvdimm: add an atomic vs process context flag to rw_bytes
    x86, pmem: Fix cache flushing for iovec write < 8 bytes
    device-dax: kill NR_DEV_DAX
    block, dax: move "select DAX" from BLOCK to FS_DAX
    device-dax: Tell kbuild DEV_DAX_PMEM depends on DEV_DAX

    Linus Torvalds
     

09 May, 2017

1 commit

  • For configurations that do not enable DAX filesystems or drivers, do not
    require the DAX core to be built.

    Given that the 'direct_access' method has been removed from
    'block_device_operations', we can also go ahead and remove the
    block-related dax helper functions from fs/block_dev.c to
    drivers/dax/super.c. This keeps dax details out of the block layer and
    lets the DAX core be built as a module in the FS_DAX=n case.

    Filesystems need to include dax.h to call bdev_dax_supported().

    Cc: linux-xfs@vger.kernel.org
    Cc: Jens Axboe
    Cc: "Theodore Ts'o"
    Cc: Matthew Wilcox
    Cc: Alexander Viro
    Cc: "Darrick J. Wong"
    Cc: Ross Zwisler
    Reviewed-by: Jan Kara
    Reported-by: Geert Uytterhoeven
    Signed-off-by: Dan Williams

    Dan Williams
     

06 May, 2017

1 commit

  • Pull libnvdimm updates from Dan Williams:
    "The bulk of this has been in multiple -next releases. There were a few
    late breaking fixes and small features that got added in the last
    couple days, but the whole set has received a build success
    notification from the kbuild robot.

    Change summary:

    - Region media error reporting: A libnvdimm region device is the
    parent to one or more namespaces. To date, media errors have been
    reported via the "badblocks" attribute attached to pmem block
    devices for namespaces in "raw" or "memory" mode. Given that
    namespaces can be in "device-dax" or "btt-sector" mode this new
    interface reports media errors generically, i.e. independent of
    namespace modes or state.

    This subsequently allows userspace tooling to craft "ACPI 6.1
    Section 9.20.7.6 Function Index 4 - Clear Uncorrectable Error"
    requests and submit them via the ioctl path for NVDIMM root bus
    devices.

    - Introduce 'struct dax_device' and 'struct dax_operations': Prompted
    by a request from Linus and feedback from Christoph this allows for
    dax capable drivers to publish their own custom dax operations.
    This fixes the broken assumption that all dax operations are
    related to a persistent memory device, and makes it easier for
    other architectures and platforms to add customized persistent
    memory support.

    - 'libnvdimm' core updates: A new "deep_flush" sysfs attribute is
    available for storage appliance applications to manually trigger
    memory controllers to drain write-pending buffers that would
    otherwise be flushed automatically by the platform ADR
    (asynchronous-DRAM-refresh) mechanism at a power loss event.
    Support for "locked" DIMMs is included to prevent namespaces from
    surfacing when the namespace label data area is locked. Finally,
    fixes for various reported deadlocks and crashes, also tagged for
    -stable.

    - ACPI / nfit driver updates: General updates of the nfit driver to
    add DSM command overrides, ACPI 6.1 health state flags support, DSM
    payload debug available by default, and various fixes.

    Acknowledgements that came after the branch was pushed:

    - commmit 565851c972b5 "device-dax: fix sysfs attribute deadlock":
    Tested-by: Yi Zhang

    - commit 23f498448362 "libnvdimm: rework region badblocks clearing"
    Tested-by: Toshi Kani "

    * tag 'libnvdimm-for-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (52 commits)
    libnvdimm, pfn: fix 'npfns' vs section alignment
    libnvdimm: handle locked label storage areas
    libnvdimm: convert NDD_ flags to use bitops, introduce NDD_LOCKED
    brd: fix uninitialized use of brd->dax_dev
    block, dax: use correct format string in bdev_dax_supported
    device-dax: fix sysfs attribute deadlock
    libnvdimm: restore "libnvdimm: band aid btt vs clear poison locking"
    libnvdimm: fix nvdimm_bus_lock() vs device_lock() ordering
    libnvdimm: rework region badblocks clearing
    acpi, nfit: kill ACPI_NFIT_DEBUG
    libnvdimm: fix clear length of nvdimm_forget_poison()
    libnvdimm, pmem: fix a NULL pointer BUG in nd_pmem_notify
    libnvdimm, region: sysfs trigger for nvdimm_flush()
    libnvdimm: fix phys_addr for nvdimm_clear_poison
    x86, dax, pmem: remove indirection around memcpy_from_pmem()
    block: remove block_device_operations ->direct_access()
    block, dax: convert bdev_dax_supported() to dax_direct_access()
    filesystem-dax: convert to dax_direct_access()
    Revert "block: use DAX for partition table reads"
    ext2, ext4, xfs: retrieve dax_device for iomap operations
    ...

    Linus Torvalds
     

26 Apr, 2017

1 commit


19 Apr, 2017

2 commits

  • Now that all places setting inode->i_flags that should be reflected in
    on-disk flags are gone, we can remove ext2_get_inode_flags() call.

    Reviewed-by: Andreas Dilger
    Signed-off-by: Jan Kara

    Jan Kara
     
  • Currently immutable and noatime flags on quota files are set by quota
    code which requires us to copy inode->i_flags to our on disk version of
    quota flags in GETFLAGS ioctl and __ext2_write_inode(). Move to setting
    / clearing these on-disk flags directly to save that copying.

    Signed-off-by: Jan Kara

    Jan Kara
     

05 Apr, 2017

1 commit

  • ext2_sync_fs() could be called without s_umount semaphore held when
    called through ext2_write_super() from __ext2_write_inode(). This
    function then calls dquot_writeback_dquots() which relies on s_umount to
    be held for protection against other quota operations.

    In fact __ext2_write_inode() does not need all the functionality
    ext2_write_super() provides. It is enough to just write the superblock.
    So use ext2_sync_super() instead.

    Fixes: 9d1ccbe70e0b14545caad12dc73adb3605447df0
    Reported-by: Jan Beulich
    Signed-off-by: Jan Kara

    Jan Kara
     

02 Mar, 2017

1 commit


25 Feb, 2017

3 commits

  • Since the introduction of FAULT_FLAG_SIZE to the vm_fault flag, it has
    been somewhat painful with getting the flags set and removed at the
    correct locations. More than one kernel oops was introduced due to
    difficulties of getting the placement correctly.

    Remove the flag values and introduce an input parameter to huge_fault
    that indicates the size of the page entry. This makes the code easier
    to trace and should avoid the issues we see with the fault flags where
    removal of the flag was necessary in the fallback paths.

    Link: http://lkml.kernel.org/r/148615748258.43180.1690152053774975329.stgit@djiang5-desk3.ch.intel.com
    Signed-off-by: Dave Jiang
    Tested-by: Dan Williams
    Reviewed-by: Jan Kara
    Cc: Matthew Wilcox
    Cc: Dave Hansen
    Cc: Vlastimil Babka
    Cc: Ross Zwisler
    Cc: Kirill A. Shutemov
    Cc: Nilesh Choudhury
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Jiang
     
  • Patch series "1G transparent hugepage support for device dax", v2.

    The following series implements support for 1G trasparent hugepage on
    x86 for device dax. The bulk of the code was written by Mathew Wilcox a
    while back supporting transparent 1G hugepage for fs DAX. I have
    forward ported the relevant bits to 4.10-rc. The current submission has
    only the necessary code to support device DAX.

    Comments from Dan Williams: So the motivation and intended user of this
    functionality mirrors the motivation and users of 1GB page support in
    hugetlbfs. Given expected capacities of persistent memory devices an
    in-memory database may want to reduce tlb pressure beyond what they can
    already achieve with 2MB mappings of a device-dax file. We have
    customer feedback to that effect as Willy mentioned in his previous
    version of these patches [1].

    [1]: https://lkml.org/lkml/2016/1/31/52

    Comments from Nilesh @ Oracle:

    There are applications which have a process model; and if you assume
    10,000 processes attempting to mmap all the 6TB memory available on a
    server; we are looking at the following:

    processes : 10,000
    memory : 6TB
    pte @ 4k page size: 8 bytes / 4K of memory * #processes = 6TB / 4k * 8 * 10000 = 1.5GB * 80000 = 120,000GB
    pmd @ 2M page size: 120,000 / 512 = ~240GB
    pud @ 1G page size: 240GB / 512 = ~480MB

    As you can see with 2M pages, this system will use up an exorbitant
    amount of DRAM to hold the page tables; but the 1G pages finally brings
    it down to a reasonable level. Memory sizes will keep increasing; so
    this number will keep increasing.

    An argument can be made to convert the applications from process model
    to thread model, but in the real world that may not be always practical.
    Hopefully this helps explain the use case where this is valuable.

    This patch (of 3):

    In preparation for adding the ability to handle PUD pages, convert
    vm_operations_struct.pmd_fault to vm_operations_struct.huge_fault. The
    vm_fault structure is extended to include a union of the different page
    table pointers that may be needed, and three flag bits are reserved to
    indicate which type of pointer is in the union.

    [ross.zwisler@linux.intel.com: remove unused function ext4_dax_huge_fault()]
    Link: http://lkml.kernel.org/r/1485813172-7284-1-git-send-email-ross.zwisler@linux.intel.com
    [dave.jiang@intel.com: clear PMD or PUD size flags when in fall through path]
    Link: http://lkml.kernel.org/r/148589842696.5820.16078080610311444794.stgit@djiang5-desk3.ch.intel.com
    Link: http://lkml.kernel.org/r/148545058784.17912.6353162518188733642.stgit@djiang5-desk3.ch.intel.com
    Signed-off-by: Matthew Wilcox
    Signed-off-by: Dave Jiang
    Signed-off-by: Ross Zwisler
    Cc: Dave Hansen
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Dan Williams
    Cc: Kirill A. Shutemov
    Cc: Nilesh Choudhury
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Cc: Dave Jiang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Jiang
     
  • ->fault(), ->page_mkwrite(), and ->pfn_mkwrite() calls do not need to
    take a vma and vmf parameter when the vma already resides in vmf.

    Remove the vma parameter to simplify things.

    [arnd@arndb.de: fix ARM build]
    Link: http://lkml.kernel.org/r/20170125223558.1451224-1-arnd@arndb.de
    Link: http://lkml.kernel.org/r/148521301778.19116.10840599906674778980.stgit@djiang5-desk3.ch.intel.com
    Signed-off-by: Dave Jiang
    Signed-off-by: Arnd Bergmann
    Reviewed-by: Ross Zwisler
    Cc: Theodore Ts'o
    Cc: Darrick J. Wong
    Cc: Matthew Wilcox
    Cc: Dave Hansen
    Cc: Christoph Hellwig
    Cc: Jan Kara
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Jiang
     

31 Jan, 2017

1 commit


25 Jan, 2017

1 commit

  • As reported by Arnd:

    https://lkml.org/lkml/2017/1/10/756

    Compiling with the following configuration:

    # CONFIG_EXT2_FS is not set
    # CONFIG_EXT4_FS is not set
    # CONFIG_XFS_FS is not set
    # CONFIG_FS_IOMAP depends on the above filesystems, as is not set
    CONFIG_FS_DAX=y

    generates build warnings about unused functions in fs/dax.c:

    fs/dax.c:878:12: warning: `dax_insert_mapping' defined but not used [-Wunused-function]
    static int dax_insert_mapping(struct address_space *mapping,
    ^~~~~~~~~~~~~~~~~~
    fs/dax.c:572:12: warning: `copy_user_dax' defined but not used [-Wunused-function]
    static int copy_user_dax(struct block_device *bdev, sector_t sector, size_t size,
    ^~~~~~~~~~~~~
    fs/dax.c:542:12: warning: `dax_load_hole' defined but not used [-Wunused-function]
    static int dax_load_hole(struct address_space *mapping, void **entry,
    ^~~~~~~~~~~~~
    fs/dax.c:312:14: warning: `grab_mapping_entry' defined but not used [-Wunused-function]
    static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index,
    ^~~~~~~~~~~~~~~~~~

    Now that the struct buffer_head based DAX fault paths and I/O path have
    been removed we really depend on iomap support being present for DAX.
    Make this explicit by selecting FS_IOMAP if we compile in DAX support.

    This allows us to remove conditional selections of FS_IOMAP when FS_DAX
    was present for ext2 and ext4, and to remove an #ifdef in fs/dax.c.

    Link: http://lkml.kernel.org/r/1484087383-29478-1-git-send-email-ross.zwisler@linux.intel.com
    Signed-off-by: Ross Zwisler
    Reported-by: Arnd Bergmann
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     

27 Dec, 2016

1 commit

  • So far we did not return BH_New buffers from ext2_get_blocks() when we
    allocated and zeroed-out a block for DAX inode to avoid racy zeroing in
    DAX code. This zeroing is gone these days so we can remove the
    workaround.

    Reviewed-by: Ross Zwisler
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jan Kara
    Signed-off-by: Dan Williams

    Jan Kara
     

25 Dec, 2016

1 commit


20 Dec, 2016

1 commit

  • Pull quota, fsnotify and ext2 updates from Jan Kara:
    "Changes to locking of some quota operations from dedicated quota mutex
    to s_umount semaphore, a fsnotify fix and a simple ext2 fix"

    * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
    quota: Fix bogus warning in dquot_disable()
    fsnotify: Fix possible use-after-free in inode iteration on umount
    ext2: reject inodes with negative size
    quota: Remove dqonoff_mutex
    ocfs2: Use s_umount for quota recovery protection
    quota: Remove dqonoff_mutex from dquot_scan_active()
    ocfs2: Protect periodic quota syncing with s_umount semaphore
    quota: Use s_umount protection for quota operations
    quota: Hold s_umount in exclusive mode when enabling / disabling quotas
    fs: Provide function to get superblock with exclusive s_umount

    Linus Torvalds
     

18 Dec, 2016

1 commit

  • …/linux/kernel/git/mszeredi/vfs

    Pull partial readlink cleanups from Miklos Szeredi.

    This is the uncontroversial part of the readlink cleanup patch-set that
    simplifies the default readlink handling.

    Miklos and Al are still discussing the rest of the series.

    * git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
    vfs: make generic_readlink() static
    vfs: remove ".readlink = generic_readlink" assignments
    vfs: default to generic_readlink()
    vfs: replace calling i_op->readlink with vfs_readlink()
    proc/self: use generic_readlink
    ecryptfs: use vfs_get_link()
    bad_inode: add missing i_op initializers

    Linus Torvalds
     

15 Dec, 2016

1 commit

  • Pull fs meta data unmap optimization from Jens Axboe:
    "A series from Jan Kara, providing a more efficient way for unmapping
    meta data from in the buffer cache than doing it block-by-block.

    Provide a general helper that existing callers can use"

    * 'for-4.10/fs-unmap' of git://git.kernel.dk/linux-block:
    fs: Remove unmap_underlying_metadata
    fs: Add helper to clean bdev aliases under a bh and use it
    ext2: Use clean_bdev_aliases() instead of iteration
    ext4: Use clean_bdev_aliases() instead of iteration
    direct-io: Use clean_bdev_aliases() instead of handmade iteration
    fs: Provide function to unmap metadata for a range of blocks

    Linus Torvalds