11 May, 2019

1 commit

  • When failing from creating cache jbd2_inode_cache, we will destroy the
    previously created cache jbd2_handle_cache twice. This patch fixes
    this by moving each cache initialization/destruction to its own
    separate, individual function.

    Signed-off-by: Chengguang Xu
    Signed-off-by: Theodore Ts'o
    Cc: stable@kernel.org

    Chengguang Xu
     

07 Apr, 2019

1 commit

  • We hit a BUG at fs/buffer.c:3057 if we detached the nbd device
    before unmounting ext4 filesystem.

    The typical chain of events leading to the BUG:
    jbd2_write_superblock
    submit_bh
    submit_bh_wbc
    BUG_ON(!buffer_mapped(bh));

    The block device is removed and all the pages are invalidated. JBD2
    was trying to write journal superblock to the block device which is
    no longer present.

    Fix this by checking the journal superblock's buffer head prior to
    submitting.

    Reported-by: Eric Ren
    Signed-off-by: Jiufei Xue
    Signed-off-by: Theodore Ts'o
    Reviewed-by: Jan Kara
    Cc: stable@kernel.org

    Jiufei Xue
     

15 Feb, 2019

2 commits

  • The functions jbd2_superblock_csum_verify() and
    jbd2_superblock_csum_set() only get called from one location, so to
    simplify things, fold them into their callers.

    Signed-off-by: Theodore Ts'o

    Theodore Ts'o
     
  • The jbd2 superblock is lockless now, so there is probably a race
    condition between writing it so disk and modifing contents of it, which
    may lead to checksum error. The following race is the one case that we
    have captured.

    jbd2 fsstress
    jbd2_journal_commit_transaction
    jbd2_journal_update_sb_log_tail
    jbd2_write_superblock
    jbd2_superblock_csum_set jbd2_journal_revoke
    jbd2_journal_set_features(revork)
    modify superblock
    submit_bh(checksum incorrect)

    Fix this by locking the buffer head before modifing it. We always
    write the jbd2 superblock after we modify it, so this just means
    calling the lock_buffer() a little earlier.

    This checksum corruption problem can be reproduced by xfstests
    generic/475.

    Reported-by: zhangyi (F)
    Suggested-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Theodore Ts'o
     

01 Feb, 2019

1 commit

  • This issue was found when I tried to put checkpoint work in a separate thread,
    the deadlock below happened:
    Thread1 | Thread2
    __jbd2_log_wait_for_space |
    jbd2_log_do_checkpoint (hold j_checkpoint_mutex)|
    if (jh->b_transaction != NULL) |
    ... |
    jbd2_log_start_commit(journal, tid); |jbd2_update_log_tail
    | will lock j_checkpoint_mutex,
    | but will be blocked here.
    |
    jbd2_log_wait_commit(journal, tid); |
    wait_event(journal->j_wait_done_commit, |
    !tid_gt(tid, journal->j_commit_sequence)); |
    ... |wake_up(j_wait_done_commit)
    } |

    then deadlock occurs, Thread1 will never be waken up.

    To fix this issue, drop j_checkpoint_mutex in jbd2_log_do_checkpoint()
    when we are going to wait for transaction commit.

    Reviewed-by: Jan Kara
    Signed-off-by: Xiaoguang Wang
    Signed-off-by: Theodore Ts'o

    Xiaoguang Wang
     

21 May, 2018

2 commits

  • The kmem_cache_destroy() function already checks for null pointers, so
    we can remove the check at the call site.

    This patch also sets jbd2_handle_cache and jbd2_inode_cache to be NULL
    after freeing them in jbd2_journal_destroy_handle_cache().

    Signed-off-by: Wang Long
    Signed-off-by: Theodore Ts'o
    Reviewed-by: Jan Kara

    Wang Long
     
  • See following dmesg output with jbd2 debug enabled:

    ...(start_this_handle, 313): New handle 00000000c88d6ceb going live.

    ...(start_this_handle, 383): Handle 00000000c88d6ceb given 53 credits (total 53, free 32681)

    ...(do_get_write_access, 838): journal_head 0000000002856fc0, force_copy 0

    ...(jbd2_journal_cancel_revoke, 421): journal_head 0000000002856fc0, cancelling revoke

    We have an extra line with every messages, this is a waste of buffer,
    we can fix it by removing "\n" in the caller or remove it in
    the __jbd2_debug(), i checked every jbd2_debug() passed '\n' explicitly.

    To avoid more lines, let's remove it inside __jbd2_debug().

    Signed-off-by: Wang Shilong
    Signed-off-by: Theodore Ts'o
    Reviewed-by: Jan Kara

    Wang Shilong
     

20 Feb, 2018

1 commit


19 Feb, 2018

1 commit

  • Previously the jbd2 layer assumed that a file system check would be
    required after a journal abort. In the case of the deliberate file
    system shutdown, this should not be necessary. Allow the jbd2 layer
    to distinguish between these two cases by using the ESHUTDOWN errno.

    Also add proper locking to __journal_abort_soft().

    Signed-off-by: Theodore Ts'o
    Cc: stable@vger.kernel.org

    Theodore Ts'o
     

18 Dec, 2017

1 commit

  • A number of ext4 source files were skipped due because their copyright
    permission statements didn't match the expected text used by the
    automated conversion utilities. I've added SPDX tags for the rest.

    While looking at some of these files, I've noticed that we have quite
    a bit of variation on the licenses that were used --- in particular
    some of the Red Hat licenses on the jbd2 files use a GPL2+ license,
    and we have some files that have a LGPL-2.1 license (which was quite
    surprising).

    I've not attempted to do any license changes. Even if it is perfectly
    legal to relicense to GPL 2.0-only for consistency's sake, that should
    be done with ext4 developer community discussion.

    Signed-off-by: Theodore Ts'o

    Theodore Ts'o
     

18 Nov, 2017

1 commit

  • Pull libnvdimm and dax updates from Dan Williams:
    "Save for a few late fixes, all of these commits have shipped in -next
    releases since before the merge window opened, and 0day has given a
    build success notification.

    The ext4 touches came from Jan, and the xfs touches have Darrick's
    reviewed-by. An xfstest for the MAP_SYNC feature has been through
    a few round of reviews and is on track to be merged.

    - Introduce MAP_SYNC and MAP_SHARED_VALIDATE, a mechanism to enable
    'userspace flush' of persistent memory updates via filesystem-dax
    mappings. It arranges for any filesystem metadata updates that may
    be required to satisfy a write fault to also be flushed ("on disk")
    before the kernel returns to userspace from the fault handler.
    Effectively every write-fault that dirties metadata completes an
    fsync() before returning from the fault handler. The new
    MAP_SHARED_VALIDATE mapping type guarantees that the MAP_SYNC flag
    is validated as supported by the filesystem's ->mmap() file
    operation.

    - Add support for the standard ACPI 6.2 label access methods that
    replace the NVDIMM_FAMILY_INTEL (vendor specific) label methods.
    This enables interoperability with environments that only implement
    the standardized methods.

    - Add support for the ACPI 6.2 NVDIMM media error injection methods.

    - Add support for the NVDIMM_FAMILY_INTEL v1.6 DIMM commands for
    latch last shutdown status, firmware update, SMART error injection,
    and SMART alarm threshold control.

    - Cleanup physical address information disclosures to be root-only.

    - Fix revalidation of the DIMM "locked label area" status to support
    dynamic unlock of the label area.

    - Expand unit test infrastructure to mock the ACPI 6.2 Translate SPA
    (system-physical-address) command and error injection commands.

    Acknowledgements that came after the commits were pushed to -next:

    - 957ac8c421ad ("dax: fix PMD faults on zero-length files"):
    Reviewed-by: Ross Zwisler

    - a39e596baa07 ("xfs: support for synchronous DAX faults") and
    7b565c9f965b ("xfs: Implement xfs_filemap_pfn_mkwrite() using __xfs_filemap_fault()")
    Reviewed-by: Darrick J. Wong "

    * tag 'libnvdimm-for-4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (49 commits)
    acpi, nfit: add 'Enable Latch System Shutdown Status' command support
    dax: fix general protection fault in dax_alloc_inode
    dax: fix PMD faults on zero-length files
    dax: stop requiring a live device for dax_flush()
    brd: remove dax support
    dax: quiet bdev_dax_supported()
    fs, dax: unify IOMAP_F_DIRTY read vs write handling policy in the dax core
    tools/testing/nvdimm: unit test clear-error commands
    acpi, nfit: validate commands against the device type
    tools/testing/nvdimm: stricter bounds checking for error injection commands
    xfs: support for synchronous DAX faults
    xfs: Implement xfs_filemap_pfn_mkwrite() using __xfs_filemap_fault()
    ext4: Support for synchronous DAX faults
    ext4: Simplify error handling in ext4_dax_huge_fault()
    dax: Implement dax_finish_sync_fault()
    dax, iomap: Add support for synchronous faults
    mm: Define MAP_SYNC and VM_SYNC flags
    dax: Allow tuning whether dax_insert_mapping_entry() dirties entry
    dax: Allow dax_iomap_fault() to return pfn
    dax: Fix comment describing dax_iomap_fault()
    ...

    Linus Torvalds
     

03 Nov, 2017

1 commit

  • We return IOMAP_F_DIRTY flag from ext4_iomap_begin() when asked to
    prepare blocks for writing and the inode has some uncommitted metadata
    changes. In the fault handler ext4_dax_fault() we then detect this case
    (through VM_FAULT_NEEDDSYNC return value) and call helper
    dax_finish_sync_fault() to flush metadata changes and insert page table
    entry. Note that this will also dirty corresponding radix tree entry
    which is what we want - fsync(2) will still provide data integrity
    guarantees for applications not using userspace flushing. And
    applications using userspace flushing can avoid calling fsync(2) and
    thus avoid the performance overhead.

    Reviewed-by: Ross Zwisler
    Signed-off-by: Jan Kara
    Signed-off-by: Dan Williams

    Jan Kara
     

19 Oct, 2017

1 commit

  • In preparation for unconditionally passing the struct timer_list pointer to
    all timer callbacks, switch to using the new timer_setup() and from_timer()
    to pass the timer pointer explicitly.

    Signed-off-by: Kees Cook
    Signed-off-by: Theodore Ts'o
    Reviewed-by: Jan Kara
    Cc: linux-ext4@vger.kernel.org
    Cc: Thomas Gleixner

    Kees Cook
     

20 Jun, 2017

1 commit


11 May, 2017

1 commit

  • Pull RCU updates from Ingo Molnar:
    "The main changes are:

    - Debloat RCU headers

    - Parallelize SRCU callback handling (plus overlapping patches)

    - Improve the performance of Tree SRCU on a CPU-hotplug stress test

    - Documentation updates

    - Miscellaneous fixes"

    * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (74 commits)
    rcu: Open-code the rcu_cblist_n_lazy_cbs() function
    rcu: Open-code the rcu_cblist_n_cbs() function
    rcu: Open-code the rcu_cblist_empty() function
    rcu: Separately compile large rcu_segcblist functions
    srcu: Debloat the header
    srcu: Adjust default auto-expediting holdoff
    srcu: Specify auto-expedite holdoff time
    srcu: Expedite first synchronize_srcu() when idle
    srcu: Expedited grace periods with reduced memory contention
    srcu: Make rcutorture writer stalls print SRCU GP state
    srcu: Exact tracking of srcu_data structures containing callbacks
    srcu: Make SRCU be built by default
    srcu: Fix Kconfig botch when SRCU not selected
    rcu: Make non-preemptive schedule be Tasks RCU quiescent state
    srcu: Expedite srcu_schedule_cbs_snp() callback invocation
    srcu: Parallelize callback handling
    kvm: Move srcu_struct fields to end of struct kvm
    rcu: Fix typo in PER_RCU_NODE_PERIOD header comment
    rcu: Use true/false in assignment to bool
    rcu: Use bool value directly
    ...

    Linus Torvalds
     

09 May, 2017

1 commit

  • Pull ext4 updates from Ted Ts'o:

    - add GETFSMAP support

    - some performance improvements for very large file systems and for
    random write workloads into a preallocated file

    - bug fixes and cleanups.

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
    jbd2: cleanup write flags handling from jbd2_write_superblock()
    ext4: mark superblock writes synchronous for nobarrier mounts
    ext4: inherit encryption xattr before other xattrs
    ext4: replace BUG_ON with WARN_ONCE in ext4_end_bio()
    ext4: avoid unnecessary transaction stalls during writeback
    ext4: preload block group descriptors
    ext4: make ext4_shutdown() static
    ext4: support GETFSMAP ioctls
    vfs: add common GETFSMAP ioctl definitions
    ext4: evict inline data when writing to memory map
    ext4: remove ext4_xattr_check_entry()
    ext4: rename ext4_xattr_check_names() to ext4_xattr_check_entries()
    ext4: merge ext4_xattr_list() into ext4_listxattr()
    ext4: constify static data that is never modified
    ext4: trim return value and 'dir' argument from ext4_insert_dentry()
    jbd2: fix dbench4 performance regression for 'nobarrier' mounts
    jbd2: Fix lockdep splat with generic/270 test
    mm: retry writepages() on ENOMEM when doing an data integrity writeback

    Linus Torvalds
     

04 May, 2017

2 commits

  • Currently jbd2_write_superblock() silently adds REQ_SYNC to flags with
    which journal superblock is written. Make this explicit by making flags
    passed down to jbd2_write_superblock() contain REQ_SYNC.

    CC: linux-ext4@vger.kernel.org
    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • kjournald2 is central to the transaction commit processing. As such any
    potential allocation from this kernel thread has to be GFP_NOFS. Make
    sure to mark the whole kernel thread GFP_NOFS by the memalloc_nofs_save.

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/20170306131408.9828-8-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Suggested-by: Jan Kara
    Reviewed-by: Jan Kara
    Cc: Dave Chinner
    Cc: Theodore Ts'o
    Cc: Chris Mason
    Cc: David Sterba
    Cc: Brian Foster
    Cc: Darrick J. Wong
    Cc: Nikolay Borisov
    Cc: Peter Zijlstra
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

30 Apr, 2017

2 commits

  • Commit b685d3d65ac7 "block: treat REQ_FUA and REQ_PREFLUSH as
    synchronous" removed REQ_SYNC flag from WRITE_FUA implementation. Since
    JBD2 strips REQ_FUA and REQ_FLUSH flags from submitted IO when the
    filesystem is mounted with nobarrier mount option, journal superblock
    writes ended up being async writes after this patch and that caused
    heavy performance regression for dbench4 benchmark with high number of
    processes. In my test setup with HP RAID array with non-volatile write
    cache and 32 GB ram, dbench4 runs with 8 processes regressed by ~25%.

    Fix the problem by making sure journal superblock writes are always
    treated as synchronous since they generally block progress of the
    journalling machinery and thus the whole filesystem.

    Fixes: b685d3d65ac791406e0dfd8779cc9b3707fea5a3
    CC: stable@vger.kernel.org
    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • I've hit a lockdep splat with generic/270 test complaining that:

    3216.fsstress.b/3533 is trying to acquire lock:
    (jbd2_handle){++++..}, at: [] jbd2_log_wait_commit+0x0/0x150

    but task is already holding lock:
    (jbd2_handle){++++..}, at: [] start_this_handle+0x35b/0x850

    The underlying problem is that jbd2_journal_force_commit_nested()
    (called from ext4_should_retry_alloc()) may get called while a
    transaction handle is started. In such case it takes care to not wait
    for commit of the running transaction (which would deadlock) but only
    for a commit of a transaction that is already committing (which is safe
    as that doesn't wait for any filesystem locks).

    In fact there are also other callers of jbd2_log_wait_commit() that take
    care to pass tid of a transaction that is already committing and for
    those cases, the lockdep instrumentation is too restrictive and leading
    to false positive reports. Fix the problem by calling
    jbd2_might_wait_for_commit() from jbd2_log_wait_commit() only if the
    transaction isn't already committing.

    Fixes: 1eaa566d368b214d99cbb973647c1b0b8102a9ae
    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     

23 Apr, 2017

1 commit


19 Apr, 2017

1 commit

  • A group of Linux kernel hackers reported chasing a bug that resulted
    from their assumption that SLAB_DESTROY_BY_RCU provided an existence
    guarantee, that is, that no block from such a slab would be reallocated
    during an RCU read-side critical section. Of course, that is not the
    case. Instead, SLAB_DESTROY_BY_RCU only prevents freeing of an entire
    slab of blocks.

    However, there is a phrase for this, namely "type safety". This commit
    therefore renames SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU in order
    to avoid future instances of this sort of confusion.

    Signed-off-by: Paul E. McKenney
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Andrew Morton
    Cc:
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    [ paulmck: Add comments mentioning the old name, as requested by Eric
    Dumazet, in order to help people familiar with the old name find
    the new one. ]
    Acked-by: David Rientjes

    Paul E. McKenney
     

16 Mar, 2017

1 commit

  • In journal_init_common(), if we failed to allocate the j_wbuf array, or
    if we failed to create the buffer_head for the journal superblock, we
    leaked the memory allocated for the revocation tables. Fix this.

    Cc: stable@vger.kernel.org # 4.9
    Fixes: f0c9fd5458bacf7b12a9a579a727dc740cbe047e
    Signed-off-by: Eric Biggers
    Signed-off-by: Theodore Ts'o
    Reviewed-by: Jan Kara

    Eric Biggers
     

21 Feb, 2017

1 commit

  • Pull ext4 updates from Ted Ts'o:
    "For this cycle we add support for the shutdown ioctl, which is
    primarily used for testing, but which can be useful on production
    systems when a scratch volume is being destroyed and the data on it
    doesn't need to be saved.

    This found (and we fixed) a number of bugs with ext4's recovery to
    corrupted file system --- the bugs increased the amount of data that
    could be potentially lost, and in the case of the inline data feature,
    could cause the kernel to BUG.

    Also included are a number of other bug fixes, including in ext4's
    fscrypt, DAX, inline data support"

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (26 commits)
    ext4: rename EXT4_IOC_GOINGDOWN to EXT4_IOC_SHUTDOWN
    ext4: fix fencepost in s_first_meta_bg validation
    ext4: don't BUG when truncating encrypted inodes on the orphan list
    ext4: do not use stripe_width if it is not set
    ext4: fix stripe-unaligned allocations
    dax: assert that i_rwsem is held exclusive for writes
    ext4: fix DAX write locking
    ext4: add EXT4_IOC_GOINGDOWN ioctl
    ext4: add shutdown bit and check for it
    ext4: rename s_resize_flags to s_ext4_flags
    ext4: return EROFS if device is r/o and journal replay is needed
    ext4: preserve the needs_recovery flag when the journal is aborted
    jbd2: don't leak modified metadata buffers on an aborted journal
    ext4: fix inline data error paths
    ext4: move halfmd4 into hash.c directly
    ext4: fix use-after-iput when fscrypt contexts are inconsistent
    jbd2: fix use after free in kjournald2()
    ext4: fix data corruption in data=journal mode
    ext4: trim allocation requests to group size
    ext4: replace BUG_ON with WARN_ON in mb_find_extent()
    ...

    Linus Torvalds
     

02 Feb, 2017

1 commit

  • Below is the synchronization issue between unmount and kjournald2
    contexts, which results into use after free issue in kjournald2().
    Fix this issue by using journal->j_state_lock to synchronize the
    wait_event() done in journal_kill_thread() and the wake_up() done
    in kjournald2().

    TASK 1:
    umount cmd:
    |--jbd2_journal_destroy() {
    |--journal_kill_thread() {
    write_lock(&journal->j_state_lock);
    journal->j_flags |= JBD2_UNMOUNT;
    ...
    write_unlock(&journal->j_state_lock);
    wake_up(&journal->j_wait_commit); TASK 2 wakes up here:
    kjournald2() {
    ...
    checks JBD2_UNMOUNT flag and calls goto end-loop;
    ...
    end_loop:
    write_unlock(&journal->j_state_lock);
    journal->j_task = NULL; --> If this thread gets
    pre-empted here, then TASK 1 wait_event will
    exit even before this thread is completely
    done.
    wait_event(journal->j_wait_done_commit, journal->j_task == NULL);
    ...
    write_lock(&journal->j_state_lock);
    write_unlock(&journal->j_state_lock);
    }
    |--kfree(journal);
    }
    }
    wake_up(&journal->j_wait_done_commit); --> this step
    now results into use after free issue.
    }

    Signed-off-by: Sahitya Tummala
    Signed-off-by: Theodore Ts'o

    Sahitya Tummala
     

14 Jan, 2017

1 commit

  • When an ext4 fs is bogged down by a lot of metadata IOs (in the
    reported case, it was deletion of millions of files, but any massive
    amount of journal writes would do), after the journal is filled up,
    tasks which try to access the filesystem and aren't currently
    performing the journal writes end up waiting in
    __jbd2_log_wait_for_space() for journal->j_checkpoint_mutex.

    Because those mutex sleeps aren't marked as iowait, this condition can
    lead to misleadingly low iowait and /proc/stat:procs_blocked. While
    iowait propagation is far from strict, this condition can be triggered
    fairly easily and annotating these sleeps correctly helps initial
    diagnosis quite a bit.

    Use the new mutex_lock_io() for journal->j_checkpoint_mutex so that
    these sleeps are properly marked as iowait.

    Reported-by: Mingbo Wan
    Signed-off-by: Tejun Heo
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andreas Dilger
    Cc: Andrew Morton
    Cc: Jan Kara
    Cc: Jens Axboe
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Theodore Ts'o
    Cc: Thomas Gleixner
    Cc: kernel-team@fb.com
    Link: http://lkml.kernel.org/r/1477673892-28940-5-git-send-email-tj@kernel.org
    Signed-off-by: Ingo Molnar

    Tejun Heo
     

25 Dec, 2016

1 commit


01 Nov, 2016

1 commit


16 Sep, 2016

1 commit

  • There are some repetitive code in jbd2_journal_init_dev() and
    jbd2_journal_init_inode(). So this patch moves the common code into
    journal_init_common() helper to simplify the code. And fix the coding
    style warnings reported by checkpatch.pl by the way.

    Signed-off-by: Geliang Tang
    Signed-off-by: Theodore Ts'o
    Reviewed-by: Jan Kara

    Geliang Tang
     

27 Jul, 2016

2 commits

  • Pull ext4 updates from Ted Ts'o:
    "The major change this cycle is deleting ext4's copy of the file system
    encryption code and switching things over to using the copies in
    fs/crypto. I've updated the MAINTAINERS file to add an entry for
    fs/crypto listing Jaeguk Kim and myself as the maintainers.

    There are also a number of bug fixes, most notably for some problems
    found by American Fuzzy Lop (AFL) courtesy of Vegard Nossum. Also
    fixed is a writeback deadlock detected by generic/130, and some
    potential races in the metadata checksum code"

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (21 commits)
    ext4: verify extent header depth
    ext4: short-cut orphan cleanup on error
    ext4: fix reference counting bug on block allocation error
    MAINTAINRES: fs-crypto maintainers update
    ext4 crypto: migrate into vfs's crypto engine
    ext2: fix filesystem deadlock while reading corrupted xattr block
    ext4: fix project quota accounting without quota limits enabled
    ext4: validate s_reserved_gdt_blocks on mount
    ext4: remove unused page_idx
    ext4: don't call ext4_should_journal_data() on the journal inode
    ext4: Fix WARN_ON_ONCE in ext4_commit_super()
    ext4: fix deadlock during page writeback
    ext4: correct error value of function verifying dx checksum
    ext4: avoid modifying checksum fields directly during checksum verification
    ext4: check for extents that wrap around
    jbd2: make journal y2038 safe
    jbd2: track more dependencies on transaction commit
    jbd2: move lockdep tracking to journal_s
    jbd2: move lockdep instrumentation for jbd2 handles
    ext4: respect the nobarrier mount option in nojournal mode
    ...

    Linus Torvalds
     
  • Pull core block updates from Jens Axboe:

    - the big change is the cleanup from Mike Christie, cleaning up our
    uses of command types and modified flags. This is what will throw
    some merge conflicts

    - regression fix for the above for btrfs, from Vincent

    - following up to the above, better packing of struct request from
    Christoph

    - a 2038 fix for blktrace from Arnd

    - a few trivial/spelling fixes from Bart Van Assche

    - a front merge check fix from Damien, which could cause issues on
    SMR drives

    - Atari partition fix from Gabriel

    - convert cfq to highres timers, since jiffies isn't granular enough
    for some devices these days. From Jan and Jeff

    - CFQ priority boost fix idle classes, from me

    - cleanup series from Ming, improving our bio/bvec iteration

    - a direct issue fix for blk-mq from Omar

    - fix for plug merging not involving the IO scheduler, like we do for
    other types of merges. From Tahsin

    - expose DAX type internally and through sysfs. From Toshi and Yigal

    * 'for-4.8/core' of git://git.kernel.dk/linux-block: (76 commits)
    block: Fix front merge check
    block: do not merge requests without consulting with io scheduler
    block: Fix spelling in a source code comment
    block: expose QUEUE_FLAG_DAX in sysfs
    block: add QUEUE_FLAG_DAX for devices to advertise their DAX support
    Btrfs: fix comparison in __btrfs_map_block()
    block: atari: Return early for unsupported sector size
    Doc: block: Fix a typo in queue-sysfs.txt
    cfq-iosched: Charge at least 1 jiffie instead of 1 ns
    cfq-iosched: Fix regression in bonnie++ rewrite performance
    cfq-iosched: Convert slice_resid from u64 to s64
    block: Convert fifo_time from ulong to u64
    blktrace: avoid using timespec
    block/blk-cgroup.c: Declare local symbols static
    block/bio-integrity.c: Add #include "blk.h"
    block/partition-generic.c: Remove a set-but-not-used variable
    block: bio: kill BIO_MAX_SIZE
    cfq-iosched: temporarily boost queue priority for idle classes
    block: drbd: avoid to use BIO_MAX_SIZE
    block: bio: remove BIO_MAX_SECTORS
    ...

    Linus Torvalds
     

30 Jun, 2016

2 commits

  • So far we were tracking only dependency on transaction commit due to
    starting a new handle (which may require commit to start a new
    transaction). Now add tracking also for other cases where we wait for
    transaction commit. This way lockdep can catch deadlocks e. g. because we
    call jbd2_journal_stop() for a synchronous handle with some locks held
    which rank below transaction start.

    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • Currently lockdep map is tracked in each journal handle. To be able to
    expand lockdep support to cover also other cases where we depend on
    transaction commit and where handle is not available, move lockdep map
    into struct journal_s. Since this makes the lockdep map shared for all
    handles, we have to use rwsem_acquire_read() for acquisitions now.

    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     

25 Jun, 2016

1 commit

  • jbd2_alloc is explicit about its allocation preferences wrt. the
    allocation size. Sub page allocations go to the slab allocator and
    larger are using either the page allocator or vmalloc. This is all good
    but the logic is unnecessarily complex.

    1) as per Ted, the vmalloc fallback is a left-over:

    : jbd2_alloc is only passed in the bh->b_size, which can't be PAGE_SIZE, so
    : the code path that calls vmalloc() should never get called. When we
    : conveted jbd2_alloc() to suppor sub-page size allocations in commit
    : d2eecb039368, there was an assumption that it could be called with a size
    : greater than PAGE_SIZE, but that's certaily not true today.

    Moreover vmalloc allocation might even lead to a deadlock because the
    callers expect GFP_NOFS context while vmalloc is GFP_KERNEL.

    2) __GFP_REPEAT for requests
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

08 Jun, 2016

3 commits


24 Apr, 2016

1 commit

  • Currently when filesystem needs to make sure data is on permanent
    storage before committing a transaction it adds inode to transaction's
    inode list. During transaction commit, jbd2 writes back all dirty
    buffers that have allocated underlying blocks and waits for the IO to
    finish. However when doing writeback for delayed allocated data, we
    allocate blocks and immediately submit the data. Thus asking jbd2 to
    write dirty pages just unnecessarily adds more work to jbd2 possibly
    writing back other redirtied blocks.

    Add support to jbd2 to allow filesystem to ask jbd2 to only wait for
    outstanding data writes before committing a transaction and thus avoid
    unnecessary writes.

    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     

05 Apr, 2016

1 commit

  • PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
    ago with promise that one day it will be possible to implement page
    cache with bigger chunks than PAGE_SIZE.

    This promise never materialized. And unlikely will.

    We have many places where PAGE_CACHE_SIZE assumed to be equal to
    PAGE_SIZE. And it's constant source of confusion on whether
    PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
    especially on the border between fs and mm.

    Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
    breakage to be doable.

    Let's stop pretending that pages in page cache are special. They are
    not.

    The changes are pretty straight-forward:

    - << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

    - page_cache_get() -> get_page();

    - page_cache_release() -> put_page();

    This patch contains automated changes generated with coccinelle using
    script below. For some reason, coccinelle doesn't patch header files.
    I've called spatch for them manually.

    The only adjustment after coccinelle is revert of changes to
    PAGE_CAHCE_ALIGN definition: we are going to drop it later.

    There are few places in the code where coccinelle didn't reach. I'll
    fix them manually in a separate patch. Comments and documentation also
    will be addressed with the separate patch.

    virtual patch

    @@
    expression E;
    @@
    - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    expression E;
    @@
    - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    @@
    - PAGE_CACHE_SHIFT
    + PAGE_SHIFT

    @@
    @@
    - PAGE_CACHE_SIZE
    + PAGE_SIZE

    @@
    @@
    - PAGE_CACHE_MASK
    + PAGE_MASK

    @@
    expression E;
    @@
    - PAGE_CACHE_ALIGN(E)
    + PAGE_ALIGN(E)

    @@
    expression E;
    @@
    - page_cache_get(E)
    + get_page(E)

    @@
    expression E;
    @@
    - page_cache_release(E)
    + put_page(E)

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

10 Mar, 2016

1 commit

  • On umount path, jbd2_journal_destroy() writes latest transaction ID
    (->j_tail_sequence) to be used at next mount.

    The bug is that ->j_tail_sequence is not holding latest transaction ID
    in some cases. So, at next mount, there is chance to conflict with
    remaining (not overwritten yet) transactions.

    mount (id=10)
    write transaction (id=11)
    write transaction (id=12)
    umount (id=10) j_tail_sequence is not updated.
    (And another case is, __jbd2_journal_clean_checkpoint_list() is called
    with empty transaction.)

    So in above cases, ->j_tail_sequence is not pointing latest
    transaction ID at umount path. Plus, REQ_FLUSH for checkpoint is not
    done too.

    So, to fix this problem with minimum changes, this patch updates
    ->j_tail_sequence, and issue REQ_FLUSH. (With more complex changes,
    some optimizations would be possible to avoid unnecessary REQ_FLUSH
    for example though.)

    BTW,

    journal->j_tail_sequence =
    ++journal->j_transaction_sequence;

    Increment of ->j_transaction_sequence seems to be unnecessary, but
    ext3 does this.

    Signed-off-by: OGAWA Hirofumi
    Signed-off-by: Theodore Ts'o
    Cc: stable@vger.kernel.org

    OGAWA Hirofumi