11 Oct, 2016

1 commit

  • Pull vfs xattr updates from Al Viro:
    "xattr stuff from Andreas

    This completes the switch to xattr_handler ->get()/->set() from
    ->getxattr/->setxattr/->removexattr"

    * 'work.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    vfs: Remove {get,set,remove}xattr inode operations
    xattr: Stop calling {get,set,remove}xattr inode operations
    vfs: Check for the IOP_XATTR flag in listxattr
    xattr: Add __vfs_{get,set,remove}xattr helpers
    libfs: Use IOP_XATTR flag for empty directory handling
    vfs: Use IOP_XATTR flag for bad-inode handling
    vfs: Add IOP_XATTR inode operations flag
    vfs: Move xattr_resolve_name to the front of fs/xattr.c
    ecryptfs: Switch to generic xattr handlers
    sockfs: Get rid of getxattr iop
    sockfs: getxattr: Fail with -EOPNOTSUPP for invalid attribute names
    kernfs: Switch to generic xattr handlers
    hfs: Switch to generic xattr handlers
    jffs2: Remove jffs2_{get,set,remove}xattr macros
    xattr: Remove unnecessary NULL attribute name check

    Linus Torvalds
     

08 Oct, 2016

3 commits

  • Merge updates from Andrew Morton:

    - fsnotify updates

    - ocfs2 updates

    - all of MM

    * emailed patches from Andrew Morton : (127 commits)
    console: don't prefer first registered if DT specifies stdout-path
    cred: simpler, 1D supplementary groups
    CREDITS: update Pavel's information, add GPG key, remove snail mail address
    mailmap: add Johan Hovold
    .gitattributes: set git diff driver for C source code files
    uprobes: remove function declarations from arch/{mips,s390}
    spelling.txt: "modeled" is spelt correctly
    nmi_backtrace: generate one-line reports for idle cpus
    arch/tile: adopt the new nmi_backtrace framework
    nmi_backtrace: do a local dump_stack() instead of a self-NMI
    nmi_backtrace: add more trigger_*_cpu_backtrace() methods
    min/max: remove sparse warnings when they're nested
    Documentation/filesystems/proc.txt: add more description for maps/smaps
    mm, proc: fix region lost in /proc/self/smaps
    proc: fix timerslack_ns CAP_SYS_NICE check when adjusting self
    proc: add LSM hook checks to /proc//timerslack_ns
    proc: relax /proc//timerslack_ns capability requirements
    meminfo: break apart a very long seq_printf with #ifdefs
    seq/proc: modify seq_put_decimal_[u]ll to take a const char *, not char
    proc: faster /proc/*/status
    ...

    Linus Torvalds
     
  • These inode operations are no longer used; remove them.

    Signed-off-by: Andreas Gruenbacher
    Signed-off-by: Al Viro

    Andreas Gruenbacher
     
  • To support DAX pmd mappings with unmodified applications, filesystems
    need to align an mmap address by the pmd size.

    Call thp_get_unmapped_area() from f_op->get_unmapped_area.

    Note, there is no change in behavior for a non-DAX file.

    Link: http://lkml.kernel.org/r/1472497881-9323-3-git-send-email-toshi.kani@hpe.com
    Signed-off-by: Toshi Kani
    Cc: Dan Williams
    Cc: Matthew Wilcox
    Cc: Ross Zwisler
    Cc: Kirill A. Shutemov
    Cc: Dave Chinner
    Cc: Jan Kara
    Cc: Theodore Ts'o
    Cc: Andreas Dilger
    Cc: Mike Kravetz
    Cc: "Kirill A. Shutemov"
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Toshi Kani
     

30 Sep, 2016

2 commits


15 Sep, 2016

1 commit


27 Jul, 2016

2 commits

  • Merge updates from Andrew Morton:

    - a few misc bits

    - ocfs2

    - most(?) of MM

    * emailed patches from Andrew Morton : (125 commits)
    thp: fix comments of __pmd_trans_huge_lock()
    cgroup: remove unnecessary 0 check from css_from_id()
    cgroup: fix idr leak for the first cgroup root
    mm: memcontrol: fix documentation for compound parameter
    mm: memcontrol: remove BUG_ON in uncharge_list
    mm: fix build warnings in
    mm, thp: convert from optimistic swapin collapsing to conservative
    mm, thp: fix comment inconsistency for swapin readahead functions
    thp: update Documentation/{vm/transhuge,filesystems/proc}.txt
    shmem: split huge pages beyond i_size under memory pressure
    thp: introduce CONFIG_TRANSPARENT_HUGE_PAGECACHE
    khugepaged: add support of collapse for tmpfs/shmem pages
    shmem: make shmem_inode_info::lock irq-safe
    khugepaged: move up_read(mmap_sem) out of khugepaged_alloc_page()
    thp: extract khugepaged from mm/huge_memory.c
    shmem, thp: respect MADV_{NO,}HUGEPAGE for file mappings
    shmem: add huge pages support
    shmem: get_unmapped_area align huge page
    shmem: prepare huge= mount option and sysfs knob
    mm, rmap: account shmem thp pages
    ...

    Linus Torvalds
     
  • Remove the unused wrappers dax_fault() and dax_pmd_fault(). After this
    removal, rename __dax_fault() and __dax_pmd_fault() to dax_fault() and
    dax_pmd_fault() respectively, and update all callers.

    The dax_fault() and dax_pmd_fault() wrappers were initially intended to
    capture some filesystem independent functionality around page faults
    (calling sb_start_pagefault() & sb_end_pagefault(), updating file mtime
    and ctime).

    However, the following commits:

    5726b27b09cc ("ext2: Add locking for DAX faults")
    ea3d7209ca01 ("ext4: fix races between page faults and hole punching")

    added locking to the ext2 and ext4 filesystems after these common
    operations but before __dax_fault() and __dax_pmd_fault() were called.
    This means that these wrappers are no longer used, and are unlikely to
    be used in the future.

    XFS has had locking analogous to what was recently added to ext2 and
    ext4 since DAX support was initially introduced by:

    6b698edeeef0 ("xfs: add DAX file operations support")

    Link: http://lkml.kernel.org/r/20160714214049.20075-2-ross.zwisler@linux.intel.com
    Signed-off-by: Ross Zwisler
    Cc: "Theodore Ts'o"
    Cc: Alexander Viro
    Cc: Andreas Dilger
    Cc: Dan Williams
    Cc: Dave Chinner
    Reviewed-by: Jan Kara
    Cc: Jonathan Corbet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     

11 Jul, 2016

1 commit


27 May, 2016

1 commit

  • Pull misc DAX updates from Vishal Verma:
    "DAX error handling for 4.7

    - Until now, dax has been disabled if media errors were found on any
    device. This enables the use of DAX in the presence of these
    errors by making all sector-aligned zeroing go through the driver.

    - The driver (already) has the ability to clear errors on writes that
    are sent through the block layer using 'DSMs' defined in ACPI 6.1.

    Other misc changes:

    - When mounting DAX filesystems, check to make sure the partition is
    page aligned. This is a requirement for DAX, and previously, we
    allowed such unaligned mounts to succeed, but subsequent
    reads/writes would fail.

    - Misc/cleanup fixes from Jan that remove unused code from DAX
    related to zeroing, writeback, and some size checks"

    * tag 'dax-misc-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
    dax: fix a comment in dax_zero_page_range and dax_truncate_page
    dax: for truncate/hole-punch, do zeroing through the driver if possible
    dax: export a low-level __dax_zero_page_range helper
    dax: use sb_issue_zerout instead of calling dax_clear_sectors
    dax: enable dax in the presence of known media errors (badblocks)
    dax: fallback from pmd to pte on error
    block: Update blkdev_dax_capable() for consistency
    xfs: Add alignment check for DAX mount
    ext2: Add alignment check for DAX mount
    ext4: Add alignment check for DAX mount
    block: Add bdev_dax_supported() for dax mount checks
    block: Add vfs_msg() interface
    dax: Remove redundant inode size checks
    dax: Remove pointless writeback from dax_do_io()
    dax: Remove zeroing from dax_io()
    dax: Remove dead zeroing code from fault handlers
    ext2: Avoid DAX zeroing to corrupt data
    ext2: Fix block zeroing in ext2_get_blocks() for DAX
    dax: Remove complete_unwritten argument
    DAX: move RADIX_DAX_ definitions to dax.c

    Linus Torvalds
     

25 May, 2016

1 commit

  • Pull ext4 updates from Ted Ts'o:
    "Fix a number of bugs, most notably a potential stale data exposure
    after a crash and a potential BUG_ON crash if a file has the data
    journalling flag enabled while it has dirty delayed allocation blocks
    that haven't been written yet. Also fix a potential crash in the new
    project quota code and a maliciously corrupted file system.

    In addition, fix some DAX-specific bugs, including when there is a
    transient ENOSPC situation and races between writes via direct I/O and
    an mmap'ed segment that could lead to lost I/O.

    Finally the usual set of miscellaneous cleanups"

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (23 commits)
    ext4: pre-zero allocated blocks for DAX IO
    ext4: refactor direct IO code
    ext4: fix race in transient ENOSPC detection
    ext4: handle transient ENOSPC properly for DAX
    dax: call get_blocks() with create == 1 for write faults to unwritten extents
    ext4: remove unmeetable inconsisteny check from ext4_find_extent()
    jbd2: remove excess descriptions for handle_s
    ext4: remove unnecessary bio get/put
    ext4: silence UBSAN in ext4_mb_init()
    ext4: address UBSAN warning in mb_find_order_for_block()
    ext4: fix oops on corrupted filesystem
    ext4: fix check of dqget() return value in ext4_ioctl_setproject()
    ext4: clean up error handling when orphan list is corrupted
    ext4: fix hang when processing corrupted orphaned inode list
    ext4: remove trailing \n from ext4_warning/ext4_error calls
    ext4: fix races between changing inode journal mode and ext4_writepages
    ext4: handle unwritten or delalloc buffers before enabling data journaling
    ext4: fix jbd2 handle extension in ext4_ext_truncate_extend_restart()
    ext4: do not ask jbd2 to write data for delalloc buffers
    jbd2: add support for avoiding data writes during transaction commits
    ...

    Linus Torvalds
     

17 May, 2016

1 commit

  • Fault handlers currently take complete_unwritten argument to convert
    unwritten extents after PTEs are updated. However no filesystem uses
    this anymore as the code is racy. Remove the unused argument.

    Reviewed-by: Ross Zwisler
    Signed-off-by: Jan Kara
    Signed-off-by: Vishal Verma

    Jan Kara
     

13 May, 2016

1 commit

  • Currently ext4 treats DAX IO the same way as direct IO. I.e., it
    allocates unwritten extents before IO is done and converts unwritten
    extents afterwards. However this way DAX IO can race with page fault to
    the same area:

    ext4_ext_direct_IO() dax_fault()
    dax_io()
    get_block() - allocates unwritten extent
    copy_from_iter_pmem()
    get_block() - converts
    unwritten block to
    written and zeroes it
    out
    ext4_convert_unwritten_extents()

    So data written with DAX IO gets lost. Similarly dax_new_buf() called
    from dax_io() can overwrite data that has been already written to the
    block via mmap.

    Fix the problem by using pre-zeroed blocks for DAX IO the same way as we
    use them for DAX mmap. The downside of this solution is that every
    allocating write writes each block twice (once zeros, once data). Fixing
    the race with locking is possible as well however we would need to
    lock-out faults for the whole range written to by DAX IO. And that is
    not easy to do without locking-out faults for the whole file which seems
    too aggressive.

    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     

02 May, 2016

2 commits

  • The kiocb already has the new position, so use that. The only interesting
    case is AIO, where we currently don't bother updating ki_pos. We're about
    to free the kiocb after we're done, so we might as well update it to make
    everyone's life simpler.

    While we're at it also return the bytes written argument passed in if
    we were successful so that the boilerplate error switch code in the
    callers can go away.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • This will allow us to do per-I/O sync file writes, as required by a lot
    of fileservers or storage targets.

    XXX: Will need a few additional audits for O_DSYNC

    Signed-off-by: Al Viro

    Christoph Hellwig
     

27 Apr, 2016

1 commit


08 Apr, 2016

1 commit

  • Pull ext4 bugfixes from Ted Ts'o:
    "These changes contains a fix for overlayfs interacting with some
    (badly behaved) dentry code in various file systems. These have been
    reviewed by Al and the respective file system mtinainers and are going
    through the ext4 tree for convenience.

    This also has a few ext4 encryption bug fixes that were discovered in
    Android testing (yes, we will need to get these sync'ed up with the
    fs/crypto code; I'll take care of that). It also has some bug fixes
    and a change to ignore the legacy quota options to allow for xfstests
    regression testing of ext4's internal quota feature and to be more
    consistent with how xfs handles this case"

    * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
    ext4: ignore quota mount options if the quota feature is enabled
    ext4 crypto: fix some error handling
    ext4: avoid calling dquot_get_next_id() if quota is not enabled
    ext4: retry block allocation for failed DIO and DAX writes
    ext4: add lockdep annotations for i_data_sem
    ext4: allow readdir()'s of large empty directories to be interrupted
    btrfs: fix crash/invalid memory access on fsync when using overlayfs
    ext4 crypto: use dget_parent() in ext4_d_revalidate()
    ext4: use file_dentry()
    ext4: use dget_parent() in ext4_file_open()
    nfs: use file_dentry()
    fs: add file_dentry()
    ext4 crypto: don't let data integrity writebacks fail with ENOMEM
    ext4: check if in-inode xattr is corrupted in ext4_expand_extra_isize_ea()

    Linus Torvalds
     

05 Apr, 2016

1 commit

  • PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
    ago with promise that one day it will be possible to implement page
    cache with bigger chunks than PAGE_SIZE.

    This promise never materialized. And unlikely will.

    We have many places where PAGE_CACHE_SIZE assumed to be equal to
    PAGE_SIZE. And it's constant source of confusion on whether
    PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
    especially on the border between fs and mm.

    Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
    breakage to be doable.

    Let's stop pretending that pages in page cache are special. They are
    not.

    The changes are pretty straight-forward:

    - << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

    - page_cache_get() -> get_page();

    - page_cache_release() -> put_page();

    This patch contains automated changes generated with coccinelle using
    script below. For some reason, coccinelle doesn't patch header files.
    I've called spatch for them manually.

    The only adjustment after coccinelle is revert of changes to
    PAGE_CAHCE_ALIGN definition: we are going to drop it later.

    There are few places in the code where coccinelle didn't reach. I'll
    fix them manually in a separate patch. Comments and documentation also
    will be addressed with the separate patch.

    virtual patch

    @@
    expression E;
    @@
    - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    expression E;
    @@
    - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    @@
    - PAGE_CACHE_SHIFT
    + PAGE_SHIFT

    @@
    @@
    - PAGE_CACHE_SIZE
    + PAGE_SIZE

    @@
    @@
    - PAGE_CACHE_MASK
    + PAGE_MASK

    @@
    expression E;
    @@
    - PAGE_CACHE_ALIGN(E)
    + PAGE_ALIGN(E)

    @@
    expression E;
    @@
    - page_cache_get(E)
    + get_page(E)

    @@
    expression E;
    @@
    - page_cache_release(E)
    + put_page(E)

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

27 Mar, 2016

2 commits

  • EXT4 may be used as lower layer of overlayfs and accessing f_path.dentry
    can lead to a crash.

    Fix by replacing direct access of file->f_path.dentry with the
    file_dentry() accessor, which will always return a native object.

    Reported-by: Daniel Axtens
    Fixes: 4bacc9c9234c ("overlayfs: Make f_path always point to the overlay and f_inode to the underlay")
    Fixes: ff978b09f973 ("ext4 crypto: move context consistency check to ext4_file_open()")
    Signed-off-by: Miklos Szeredi
    Signed-off-by: Theodore Ts'o
    Cc: David Howells
    Cc: Al Viro
    Cc: # v4.5

    Miklos Szeredi
     
  • In f_op->open() lock on parent is not held, so there's no guarantee that
    parent dentry won't go away at any time.

    Even after this patch there's no guarantee that 'dir' will stay the parent
    of 'inode', but at least it won't be freed while being used.

    Fixes: ff978b09f973 ("ext4 crypto: move context consistency check to ext4_file_open()")
    Signed-off-by: Miklos Szeredi
    Signed-off-by: Theodore Ts'o
    Cc: # v4.5

    Miklos Szeredi
     

18 Mar, 2016

1 commit

  • Pull ext4 updates from Ted Ts'o:
    "Performance improvements in SEEK_DATA and xattr scalability
    improvements, plus a lot of clean ups and bug fixes"

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (38 commits)
    ext4: clean up error handling in the MMP support
    jbd2: do not fail journal because of frozen_buffer allocation failure
    ext4: use __GFP_NOFAIL in ext4_free_blocks()
    ext4: fix compile error while opening the macro DOUBLE_CHECK
    ext4: print ext4 mount option data_err=abort correctly
    ext4: fix NULL pointer dereference in ext4_mark_inode_dirty()
    ext4: drop unneeded BUFFER_TRACE in ext4_delete_inline_entry()
    ext4: fix misspellings in comments.
    jbd2: fix FS corruption possibility in jbd2_journal_destroy() on umount path
    ext4: more efficient SEEK_DATA implementation
    ext4: cleanup handling of bh->b_state in DAX mmap
    ext4: return hole from ext4_map_blocks()
    ext4: factor out determining of hole size
    ext4: fix setting of referenced bit in ext4_es_lookup_extent()
    ext4: remove i_ioend_count
    ext4: simplify io_end handling for AIO DIO
    ext4: move trans handling and completion deferal out of _ext4_get_block
    ext4: rename and split get blocks functions
    ext4: use i_mutex to serialize unaligned AIO DIO
    ext4: pack ioend structure better
    ...

    Linus Torvalds
     

10 Mar, 2016

1 commit

  • Using SEEK_DATA in a huge sparse file can easily lead to sotflockups as
    ext4_seek_data() iterates hole block-by-block. Fix the problem by using
    returned hole size from ext4_map_blocks() and thus skip the hole in one
    go.

    Update also SEEK_HOLE implementation to follow the same pattern as
    SEEK_DATA to make future maintenance easier.

    Furthermore we add cond_resched() to both ext4_seek_data() and
    ext4_seek_hole() to avoid softlockups in case evil user creates huge
    fragmented file and we have to go through lots of extents.

    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     

09 Mar, 2016

1 commit

  • Currently we've used hashed aio_mutex to serialize unaligned AIO DIO.
    However the code cleanups that happened after 2011 when the lock was
    introduced made aio_mutex acquired at almost the same places where we
    already have exclusion using i_mutex. So just use i_mutex for the
    exclusion of unaligned AIO DIO.

    The change moves waiting for pending unwritten extent conversion under
    i_mutex. That makes special handling of O_APPEND writes unnecessary and
    also avoids possible livelocking of unaligned AIO DIO with aligned one
    (nothing was preventing contiguous stream of aligned AIO DIOs to let
    unaligned AIO DIO wait forever).

    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     

28 Feb, 2016

1 commit

  • As it is currently written ext4_dax_mkwrite() assumes that the call into
    __dax_mkwrite() will not have to do a block allocation so it doesn't create
    a journal entry. For a read that creates a zero page to cover a hole
    followed by a write that actually allocates storage this is incorrect. The
    ext4_dax_mkwrite() -> __dax_mkwrite() -> __dax_fault() path calls
    get_blocks() to allocate storage.

    Fix this by having the ->page_mkwrite fault handler call ext4_dax_fault()
    as this function already has all the logic needed to allocate a journal
    entry and call __dax_fault().

    Also update the ext2 fault handlers in this same way to remove duplicate
    code and keep the logic between ext2 and ext4 the same.

    Reviewed-by: Jan Kara
    Signed-off-by: Ross Zwisler
    Signed-off-by: Theodore Ts'o

    Ross Zwisler
     

08 Feb, 2016

1 commit

  • In the case where the per-file key for the directory is cached, but
    root does not have access to the key needed to derive the per-file key
    for the files in the directory, we allow the lookup to succeed, so
    that lstat(2) and unlink(2) can suceed. However, if a program tries
    to open the file, it will get an ENOKEY error.

    Signed-off-by: Theodore Ts'o

    Theodore Ts'o
     

24 Jan, 2016

1 commit

  • Pull final vfs updates from Al Viro:

    - The ->i_mutex wrappers (with small prereq in lustre)

    - a fix for too early freeing of symlink bodies on shmem (they need to
    be RCU-delayed) (-stable fodder)

    - followup to dedupe stuff merged this cycle

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    vfs: abort dedupe loop if fatal signals are pending
    make sure that freeing shmem fast symlinks is RCU-delayed
    wrappers for ->i_mutex access
    lustre: remove unused declaration

    Linus Torvalds
     

23 Jan, 2016

2 commits

  • To properly support the new DAX fsync/msync infrastructure filesystems
    need to call dax_pfn_mkwrite() so that DAX can track when user pages are
    dirtied.

    Signed-off-by: Ross Zwisler
    Cc: "H. Peter Anvin"
    Cc: "J. Bruce Fields"
    Cc: "Theodore Ts'o"
    Cc: Alexander Viro
    Cc: Andreas Dilger
    Cc: Dave Chinner
    Cc: Ingo Molnar
    Cc: Jan Kara
    Cc: Jeff Layton
    Cc: Matthew Wilcox
    Cc: Thomas Gleixner
    Cc: Dan Williams
    Cc: Matthew Wilcox
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     
  • parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
    inode_foo(inode) being mutex_foo(&inode->i_mutex).

    Please, use those for access to ->i_mutex; over the coming cycle
    ->i_mutex will become rwsem, with ->lookup() done with it held
    only shared.

    Signed-off-by: Al Viro

    Al Viro
     

08 Dec, 2015

2 commits

  • Make DAX fault path use pre-zeroed blocks to avoid races with extent
    conversion and zeroing when two page faults to the same block happen.

    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • Currently, page faults and hole punching are completely unsynchronized.
    This can result in page fault faulting in a page into a range that we
    are punching after truncate_pagecache_range() has been called and thus
    we can end up with a page mapped to disk blocks that will be shortly
    freed. Filesystem corruption will shortly follow. Note that the same
    race is avoided for truncate by checking page fault offset against
    i_size but there isn't similar mechanism available for punching holes.

    Fix the problem by creating new rw semaphore i_mmap_sem in inode and
    grab it for writing over truncate, hole punching, and other functions
    removing blocks from extent tree and for read over page faults. We
    cannot easily use i_data_sem for this since that ranks below transaction
    start and we need something ranking above it so that it can be held over
    the whole truncate / hole punching operation. Also remove various
    workarounds we had in the code to reduce race window when page fault
    could have created pages with stale mapping information.

    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     

09 Sep, 2015

5 commits

  • Jan Kara pointed out that in the case where we are writing to a hole, we
    can end up with a lock inversion between the page lock and the journal
    lock. We can avoid this by starting the transaction in ext4 before
    calling into DAX. The journal lock nests inside the superblock
    pagefault lock, so we have to duplicate that code from dax_fault, like
    XFS does.

    Signed-off-by: Matthew Wilcox
    Cc: Jan Kara
    Cc: Theodore Ts'o
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • DAX wants different semantics from any currently-existing ext4 get_block
    callback. Unlike ext4_get_block_write(), it needs to honour the
    'create' flag, and unlike ext4_get_block(), it needs to be able to
    return unwritten extents. So introduce a new ext4_get_block_dax() which
    has those semantics.

    We could also change ext4_get_block_write() to honour the 'create' flag,
    but that might have consequences on other users that I do not currently
    understand.

    Signed-off-by: Matthew Wilcox
    Cc: Theodore Ts'o
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • DAX relies on the get_block function either zeroing newly allocated
    blocks before they're findable by subsequent calls to get_block, or
    marking newly allocated blocks as unwritten. ext4_get_block() cannot
    create unwritten extents, but ext4_get_block_write() can.

    Signed-off-by: Matthew Wilcox
    Reported-by: Andy Rudoff
    Cc: Theodore Ts'o
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • Use DAX to provide support for huge pages.

    Signed-off-by: Matthew Wilcox
    Cc: Hillf Danton
    Cc: "Kirill A. Shutemov"
    Cc: Theodore Ts'o
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • In order to handle the !CONFIG_TRANSPARENT_HUGEPAGES case, we need to
    return VM_FAULT_FALLBACK from the inlined dax_pmd_fault(), which is
    defined in linux/mm.h. Given that we don't want to include
    in , the easiest solution is to move the DAX-related
    functions to a new header, . We could also have moved
    VM_FAULT_* definitions to a new header, or a different header that isn't
    quite such a boil-the-ocean header as , but this felt like
    the best option.

    Signed-off-by: Matthew Wilcox
    Cc: Hillf Danton
    Cc: "Kirill A. Shutemov"
    Cc: Theodore Ts'o
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     

01 Jul, 2015

1 commit

  • Pul xfs updates from Dave Chinner:
    "There's a couple of small API changes to the core DAX code which
    required small changes to the ext2 and ext4 code bases, but otherwise
    everything is within the XFS codebase.

    This update contains:

    - A new sparse on-disk inode record format to allow small extents to
    be used for inode allocation when free space is fragmented.

    - DAX support. This includes minor changes to the DAX core code to
    fix problems with lock ordering and bufferhead mapping abuse.

    - transaction commit interface cleanup

    - removal of various unnecessary XFS specific type definitions

    - cleanup and optimisation of freelist preparation before allocation

    - various minor cleanups

    - bug fixes for
    - transaction reservation leaks
    - incorrect inode logging in unwritten extent conversion
    - mmap lock vs freeze ordering
    - remote symlink mishandling
    - attribute fork removal issues"

    * tag 'xfs-for-linus-4.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (49 commits)
    xfs: don't truncate attribute extents if no extents exist
    xfs: clean up XFS_MIN_FREELIST macros
    xfs: sanitise error handling in xfs_alloc_fix_freelist
    xfs: factor out free space extent length check
    xfs: xfs_alloc_fix_freelist() can use incore perag structures
    xfs: remove xfs_caddr_t
    xfs: use void pointers in log validation helpers
    xfs: return a void pointer from xfs_buf_offset
    xfs: remove inst_t
    xfs: remove __psint_t and __psunsigned_t
    xfs: fix remote symlinks on V5/CRC filesystems
    xfs: fix xfs_log_done interface
    xfs: saner xfs_trans_commit interface
    xfs: remove the flags argument to xfs_trans_cancel
    xfs: pass a boolean flag to xfs_trans_free_items
    xfs: switch remaining xfs_trans_dup users to xfs_trans_roll
    xfs: check min blks for random debug mode sparse allocations
    xfs: fix sparse inodes 32-bit compile failure
    xfs: add initial DAX support
    xfs: add DAX IO path support
    ...

    Linus Torvalds
     

04 Jun, 2015

1 commit

  • dax_fault() currently relies on the get_block callback to attach an
    io completion callback to the mapping buffer head so that it can
    run unwritten extent conversion after zeroing allocated blocks.

    Instead of this hack, pass the conversion callback directly into
    dax_fault() similar to the get_block callback. When the filesystem
    allocates unwritten extents, it will set the buffer_unwritten()
    flag, and hence the dax_fault code can call the completion function
    in the contexts where it is necessary without overloading the
    mapping buffer head.

    Note: The changes to ext4 to use this interface are suspect at best.
    In fact, the way ext4 did this end_io assignment in the first place
    looks suspect because it only set a completion callback when there
    wasn't already some other write() call taking place on the same
    inode. The ext4 end_io code looks rather intricate and fragile with
    all it's reference counting and passing to different contexts for
    modification via inode private pointers that aren't protected by
    locks...

    Signed-off-by: Dave Chinner
    Acked-by: Jan Kara
    Signed-off-by: Dave Chinner

    Dave Chinner
     

01 Jun, 2015

1 commit


19 May, 2015

1 commit

  • This is a pretty massive patch which does a number of different things:

    1) The per-inode encryption information is now stored in an allocated
    data structure, ext4_crypt_info, instead of directly in the node.
    This reduces the size usage of an in-memory inode when it is not
    using encryption.

    2) We drop the ext4_fname_crypto_ctx entirely, and use the per-inode
    encryption structure instead. This remove an unnecessary memory
    allocation and free for the fname_crypto_ctx as well as allowing us
    to reuse the ctfm in a directory for multiple lookups and file
    creations.

    3) We also cache the inode's policy information in the ext4_crypt_info
    structure so we don't have to continually read it out of the
    extended attributes.

    4) We now keep the keyring key in the inode's encryption structure
    instead of releasing it after we are done using it to derive the
    per-inode key. This allows us to test to see if the key has been
    revoked; if it has, we prevent the use of the derived key and free
    it.

    5) When an inode is released (or when the derived key is freed), we
    will use memset_explicit() to zero out the derived key, so it's not
    left hanging around in memory. This implies that when a user logs
    out, it is important to first revoke the key, and then unlink it,
    and then finally, to use "echo 3 > /proc/sys/vm/drop_caches" to
    release any decrypted pages and dcache entries from the system
    caches.

    6) All this, and we also shrink the number of lines of code by around
    100. :-)

    Signed-off-by: Theodore Ts'o

    Theodore Ts'o