14 Nov, 2018
1 commit
-
commit ccd3c4373eacb044eb3832966299d13d2631f66f upstream.
The code cleaning transaction's lists of checkpoint buffers has a bug
where it increases bh refcount only after releasing
journal->j_list_lock. Thus the following race is possible:CPU0 CPU1
jbd2_log_do_checkpoint()
jbd2_journal_try_to_free_buffers()
__journal_try_to_free_buffer(bh)
...
while (transaction->t_checkpoint_io_list)
...
if (buffer_locked(bh)) {spin_unlock(&journal->j_list_lock);
spin_lock(&journal->j_list_lock);
__jbd2_journal_remove_checkpoint(jh);
spin_unlock(&journal->j_list_lock);
try_to_free_buffers(page);
get_bh(bh) j_list_lock.Fixes: dc6e8d669cf5 ("jbd2: don't call get_bh() before calling __jbd2_journal_remove_checkpoint()")
Fixes: be1158cc615f ("jbd2: fold __process_buffer() into jbd2_log_do_checkpoint()")
Reported-by: syzbot+7f4a27091759e2fe7453@syzkaller.appspotmail.com
CC: stable@vger.kernel.org
Reviewed-by: Lukas Czerner
Signed-off-by: Jan Kara
Signed-off-by: Theodore Ts'o
Signed-off-by: Greg Kroah-Hartman
11 Jul, 2018
1 commit
-
commit e09463f220ca9a1a1ecfda84fcda658f99a1f12a upstream.
Do not set the b_modified flag in block's journal head should not
until after we're sure that jbd2_journal_dirty_metadat() will not
abort with an error due to there not being enough space reserved in
the jbd2 handle.Otherwise, future attempts to modify the buffer may lead a large
number of spurious errors and warnings.This addresses CVE-2018-10883.
https://bugzilla.kernel.org/show_bug.cgi?id=200071
Signed-off-by: Theodore Ts'o
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman
02 May, 2018
1 commit
-
commit b2569260d55228b617bd82aba6d0db2faeeb4116 upstream.
If ext4 tries to start a reserved handle via
jbd2_journal_start_reserved(), and the journal has been aborted, this
can result in a NULL pointer dereference. This is because the fields
h_journal and h_transaction in the handle structure share the same
memory, via a union, so jbd2_journal_start_reserved() will clear
h_journal before calling start_this_handle(). If this function fails
due to an aborted handle, h_journal will still be NULL, and the call
to jbd2_journal_free_reserved() will pass a NULL journal to
sub_reserve_credits().This can be reproduced by running "kvm-xfstests -c dioread_nolock
generic/475".Cc: stable@kernel.org # 3.11
Fixes: 8f7d89f36829b ("jbd2: transaction reservation support")
Signed-off-by: Theodore Ts'o
Reviewed-by: Andreas Dilger
Reviewed-by: Jan Kara
Signed-off-by: Greg Kroah-Hartman
24 Apr, 2018
2 commits
-
commit fb7c02445c497943e7296cd3deee04422b63acb8 upstream.
Previously the jbd2 layer assumed that a file system check would be
required after a journal abort. In the case of the deliberate file
system shutdown, this should not be necessary. Allow the jbd2 layer
to distinguish between these two cases by using the ESHUTDOWN errno.Also add proper locking to __journal_abort_soft().
Signed-off-by: Theodore Ts'o
Cc: stable@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman -
commit 85e0c4e89c1b864e763c4e3bb15d0b6d501ad5d9 upstream.
This updates the jbd2 superblock unnecessarily, and on an abort we
shouldn't truncate the log.Signed-off-by: Theodore Ts'o
Cc: stable@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman
22 Feb, 2018
1 commit
-
commit f69120ce6c024aa634a8fc25787205e42f0ccbe6 upstream.
Sphinx emits various (26) warnings when building make target 'htmldocs'.
Currently struct definitions contain duplicate documentation, some as
kernel-docs and some as standard c89 comments. We can reduce
duplication while cleaning up the kernel docs.Move all kernel-docs to right above each struct member. Use the set of
all existing comments (kernel-doc and c89). Add documentation for
missing struct members and function arguments.Signed-off-by: Tobin C. Harding
Signed-off-by: Theodore Ts'o
Cc: stable@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman
08 Jul, 2017
1 commit
-
Pull Writeback error handling updates from Jeff Layton:
"This pile represents the bulk of the writeback error handling fixes
that I have for this cycle. Some of the earlier patches in this pile
may look trivial but they are prerequisites for later patches in the
series.The aim of this set is to improve how we track and report writeback
errors to userland. Most applications that care about data integrity
will periodically call fsync/fdatasync/msync to ensure that their
writes have made it to the backing store.For a very long time, we have tracked writeback errors using two flags
in the address_space: AS_EIO and AS_ENOSPC. Those flags are set when a
writeback error occurs (via mapping_set_error) and are cleared as a
side-effect of filemap_check_errors (as you noted yesterday). This
model really sucks for userland.Only the first task to call fsync (or msync or fdatasync) will see the
error. Any subsequent task calling fsync on a file will get back 0
(unless another writeback error occurs in the interim). If I have
several tasks writing to a file and calling fsync to ensure that their
writes got stored, then I need to have them coordinate with one
another. That's difficult enough, but in a world of containerized
setups that coordination may even not be possible.But wait...it gets worse!
The calls to filemap_check_errors can be buried pretty far down in the
call stack, and there are internal callers of filemap_write_and_wait
and the like that also end up clearing those errors. Many of those
callers ignore the error return from that function or return it to
userland at nonsensical times (e.g. truncate() or stat()). If I get
back -EIO on a truncate, there is no reason to think that it was
because some previous writeback failed, and a subsequent fsync() will
(incorrectly) return 0.This pile aims to do three things:
1) ensure that when a writeback error occurs that that error will be
reported to userland on a subsequent fsync/fdatasync/msync call,
regardless of what internal callers are doing2) report writeback errors on all file descriptions that were open at
the time that the error occurred. This is a user-visible change,
but I think most applications are written to assume this behavior
anyway. Those that aren't are unlikely to be hurt by it.3) document what filesystems should do when there is a writeback
error. Today, there is very little consistency between them, and a
lot of cargo-cult copying. We need to make it very clear what
filesystems should do in this situation.To achieve this, the set adds a new data type (errseq_t) and then
builds new writeback error tracking infrastructure around that. Once
all of that is in place, we change the filesystems to use the new
infrastructure for reporting wb errors to userland.Note that this is just the initial foray into cleaning up this mess.
There is a lot of work remaining here:1) convert the rest of the filesystems in a similar fashion. Once the
initial set is in, then I think most other fs' will be fairly
simple to convert. Hopefully most of those can in via individual
filesystem trees.2) convert internal waiters on writeback to use errseq_t for
detecting errors instead of relying on the AS_* flags. I have some
draft patches for this for ext4, but they are not quite ready for
prime time yet.This was a discussion topic this year at LSF/MM too. If you're
interested in the gory details, LWN has some good articles about this:https://lwn.net/Articles/718734/
https://lwn.net/Articles/724307/"* tag 'for-linus-v4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
btrfs: minimal conversion to errseq_t writeback error reporting on fsync
xfs: minimal conversion to errseq_t writeback error reporting
ext4: use errseq_t based error handling for reporting data writeback errors
fs: convert __generic_file_fsync to use errseq_t based reporting
block: convert to errseq_t based writeback error tracking
dax: set errors in mapping when writeback fails
Documentation: flesh out the section in vfs.txt on storing and reporting writeback errors
mm: set both AS_EIO/AS_ENOSPC and errseq_t in mapping_set_error
fs: new infrastructure for writeback error handling and reporting
lib: add errseq_t type and infrastructure for handling it
mm: don't TestClearPageError in __filemap_fdatawait_range
mm: clear AS_EIO/AS_ENOSPC when writeback initiation fails
jbd2: don't clear and reset errors after waiting on writeback
buffer: set errors in mapping at the time that the error occurs
fs: check for writeback errors after syncing out buffers in generic_file_fsync
buffer: use mapping_set_error instead of setting the flag
mm: fix mapping_set_error call in me_pagecache_dirty
06 Jul, 2017
1 commit
-
Resetting this flag is almost certainly racy, and will be problematic
with some coming changes.Make filemap_fdatawait_keep_errors return int, but not clear the flag(s).
Have jbd2 call it instead of filemap_fdatawait and don't attempt to
re-set the error flag if it fails.Reviewed-by: Jan Kara
Reviewed-by: Carlos Maiolino
Signed-off-by: Jeff Layton
04 Jul, 2017
1 commit
-
Pull documentation updates from Jonathan Corbet:
"There has been a fair amount of activity in the docs tree this time
around. Highlights include:- Conversion of a bunch of security documentation into RST
- The conversion of the remaining DocBook templates by The Amazing
Mauro Machine. We can now drop the entire DocBook build chain.- The usual collection of fixes and minor updates"
* tag 'docs-4.13' of git://git.lwn.net/linux: (90 commits)
scripts/kernel-doc: handle DECLARE_HASHTABLE
Documentation: atomic_ops.txt is core-api/atomic_ops.rst
Docs: clean up some DocBook loose ends
Make the main documentation title less Geocities
Docs: Use kernel-figure in vidioc-g-selection.rst
Docs: fix table problems in ras.rst
Docs: Fix breakage with Sphinx 1.5 and upper
Docs: Include the Latex "ifthen" package
doc/kokr/howto: Only send regression fixes after -rc1
docs-rst: fix broken links to dynamic-debug-howto in kernel-parameters
doc: Document suitability of IBM Verse for kernel development
Doc: fix a markup error in coding-style.rst
docs: driver-api: i2c: remove some outdated information
Documentation: DMA API: fix a typo in a function name
Docs: Insert missing space to separate link from text
doc/ko_KR/memory-barriers: Update control-dependencies example
Documentation, kbuild: fix typo "minimun" -> "minimum"
docs: Fix some formatting issues in request-key.rst
doc: ReSTify keys-trusted-encrypted.txt
doc: ReSTify keys-request-key.txt
...
20 Jun, 2017
1 commit
-
Rename 'struct wait_bit_queue::wait' to ::wq_entry, to more clearly
name it as a wait-queue entry.Propagate it to a couple of usage sites where the wait-bit-queue internals
are exposed.Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar
22 May, 2017
1 commit
-
When a transaction starts, start_this_handle() saves current
PF_MEMALLOC_NOFS value so that it can be restored at journal stop time.
Journal restart is a special case that calls start_this_handle() without
stopping the transaction. start_this_handle() isn't aware that the
original value is already stored so it overwrites it with current value.For instance, a call sequence like below leaves PF_MEMALLOC_NOFS flag set
at the end:jbd2_journal_start()
jbd2__journal_restart()
jbd2_journal_stop()Make jbd2__journal_restart() restore the original value before calling
start_this_handle().Fixes: 81378da64de6 ("jbd2: mark the transaction context with the scope GFP_NOFS context")
Signed-off-by: Tahsin Erdogan
Signed-off-by: Theodore Ts'o
Reviewed-by: Jan Kara
19 May, 2017
1 commit
-
Mauro says:
This patch series convert the remaining DocBooks to ReST.
The first version was originally
send as 3 patch series:[PATCH 00/36] Convert DocBook documents to ReST
[PATCH 0/5] Convert more books to ReST
[PATCH 00/13] Get rid of DocBookThe lsm book was added as if it were a text file under
Documentation. The plan is to merge it with another file
under Documentation/security, after both this series and
a security Documentation patch series gets merged.It also adjusts some Sphinx-pedantic errors/warnings on
some kernel-doc markups.I also added some patches here to add PDF output for all
existing ReST books.
16 May, 2017
2 commits
-
kernel-doc will try to interpret a foo() string, except if
properly escaped.Reviewed-by: Jan Kara
Signed-off-by: Mauro Carvalho Chehab -
kernel-doc script expects that a function documentation to
be just before the function, otherwise it will be ignored.So, move the kernel-doc markup to the right place.
Reviewed-by: Jan Kara
Signed-off-by: Mauro Carvalho Chehab
11 May, 2017
1 commit
-
Pull RCU updates from Ingo Molnar:
"The main changes are:- Debloat RCU headers
- Parallelize SRCU callback handling (plus overlapping patches)
- Improve the performance of Tree SRCU on a CPU-hotplug stress test
- Documentation updates
- Miscellaneous fixes"
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (74 commits)
rcu: Open-code the rcu_cblist_n_lazy_cbs() function
rcu: Open-code the rcu_cblist_n_cbs() function
rcu: Open-code the rcu_cblist_empty() function
rcu: Separately compile large rcu_segcblist functions
srcu: Debloat the header
srcu: Adjust default auto-expediting holdoff
srcu: Specify auto-expedite holdoff time
srcu: Expedite first synchronize_srcu() when idle
srcu: Expedited grace periods with reduced memory contention
srcu: Make rcutorture writer stalls print SRCU GP state
srcu: Exact tracking of srcu_data structures containing callbacks
srcu: Make SRCU be built by default
srcu: Fix Kconfig botch when SRCU not selected
rcu: Make non-preemptive schedule be Tasks RCU quiescent state
srcu: Expedite srcu_schedule_cbs_snp() callback invocation
srcu: Parallelize callback handling
kvm: Move srcu_struct fields to end of struct kvm
rcu: Fix typo in PER_RCU_NODE_PERIOD header comment
rcu: Use true/false in assignment to bool
rcu: Use bool value directly
...
09 May, 2017
1 commit
-
Pull ext4 updates from Ted Ts'o:
- add GETFSMAP support
- some performance improvements for very large file systems and for
random write workloads into a preallocated file- bug fixes and cleanups.
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
jbd2: cleanup write flags handling from jbd2_write_superblock()
ext4: mark superblock writes synchronous for nobarrier mounts
ext4: inherit encryption xattr before other xattrs
ext4: replace BUG_ON with WARN_ONCE in ext4_end_bio()
ext4: avoid unnecessary transaction stalls during writeback
ext4: preload block group descriptors
ext4: make ext4_shutdown() static
ext4: support GETFSMAP ioctls
vfs: add common GETFSMAP ioctl definitions
ext4: evict inline data when writing to memory map
ext4: remove ext4_xattr_check_entry()
ext4: rename ext4_xattr_check_names() to ext4_xattr_check_entries()
ext4: merge ext4_xattr_list() into ext4_listxattr()
ext4: constify static data that is never modified
ext4: trim return value and 'dir' argument from ext4_insert_dentry()
jbd2: fix dbench4 performance regression for 'nobarrier' mounts
jbd2: Fix lockdep splat with generic/270 test
mm: retry writepages() on ENOMEM when doing an data integrity writeback
04 May, 2017
3 commits
-
Currently jbd2_write_superblock() silently adds REQ_SYNC to flags with
which journal superblock is written. Make this explicit by making flags
passed down to jbd2_write_superblock() contain REQ_SYNC.CC: linux-ext4@vger.kernel.org
Signed-off-by: Jan Kara
Signed-off-by: Theodore Ts'o -
kjournald2 is central to the transaction commit processing. As such any
potential allocation from this kernel thread has to be GFP_NOFS. Make
sure to mark the whole kernel thread GFP_NOFS by the memalloc_nofs_save.[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/20170306131408.9828-8-mhocko@kernel.org
Signed-off-by: Michal Hocko
Suggested-by: Jan Kara
Reviewed-by: Jan Kara
Cc: Dave Chinner
Cc: Theodore Ts'o
Cc: Chris Mason
Cc: David Sterba
Cc: Brian Foster
Cc: Darrick J. Wong
Cc: Nikolay Borisov
Cc: Peter Zijlstra
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
now that we have memalloc_nofs_{save,restore} api we can mark the whole
transaction context as implicitly GFP_NOFS. All allocations will
automatically inherit GFP_NOFS this way. This means that we do not have
to mark any of those requests with GFP_NOFS and moreover all the
ext4_kv[mz]alloc(GFP_NOFS) are also safe now because even the hardcoded
GFP_KERNEL allocations deep inside the vmalloc will be NOFS now.[akpm@linux-foundation.org: tweak comments]
Link: http://lkml.kernel.org/r/20170306131408.9828-7-mhocko@kernel.org
Signed-off-by: Michal Hocko
Reviewed-by: Jan Kara
Cc: Dave Chinner
Cc: Theodore Ts'o
Cc: Chris Mason
Cc: David Sterba
Cc: Brian Foster
Cc: Darrick J. Wong
Cc: Nikolay Borisov
Cc: Peter Zijlstra
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
30 Apr, 2017
2 commits
-
Commit b685d3d65ac7 "block: treat REQ_FUA and REQ_PREFLUSH as
synchronous" removed REQ_SYNC flag from WRITE_FUA implementation. Since
JBD2 strips REQ_FUA and REQ_FLUSH flags from submitted IO when the
filesystem is mounted with nobarrier mount option, journal superblock
writes ended up being async writes after this patch and that caused
heavy performance regression for dbench4 benchmark with high number of
processes. In my test setup with HP RAID array with non-volatile write
cache and 32 GB ram, dbench4 runs with 8 processes regressed by ~25%.Fix the problem by making sure journal superblock writes are always
treated as synchronous since they generally block progress of the
journalling machinery and thus the whole filesystem.Fixes: b685d3d65ac791406e0dfd8779cc9b3707fea5a3
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara
Signed-off-by: Theodore Ts'o -
I've hit a lockdep splat with generic/270 test complaining that:
3216.fsstress.b/3533 is trying to acquire lock:
(jbd2_handle){++++..}, at: [] jbd2_log_wait_commit+0x0/0x150but task is already holding lock:
(jbd2_handle){++++..}, at: [] start_this_handle+0x35b/0x850The underlying problem is that jbd2_journal_force_commit_nested()
(called from ext4_should_retry_alloc()) may get called while a
transaction handle is started. In such case it takes care to not wait
for commit of the running transaction (which would deadlock) but only
for a commit of a transaction that is already committing (which is safe
as that doesn't wait for any filesystem locks).In fact there are also other callers of jbd2_log_wait_commit() that take
care to pass tid of a transaction that is already committing and for
those cases, the lockdep instrumentation is too restrictive and leading
to false positive reports. Fix the problem by calling
jbd2_might_wait_for_commit() from jbd2_log_wait_commit() only if the
transaction isn't already committing.Fixes: 1eaa566d368b214d99cbb973647c1b0b8102a9ae
Signed-off-by: Jan Kara
Signed-off-by: Theodore Ts'o
23 Apr, 2017
1 commit
-
…k/linux-rcu into core/rcu
Pull RCU updates from Paul E. McKenney:
- Documentation updates.
- Miscellaneous fixes.
- Parallelize SRCU callback handling (plus overlapping patches).
Signed-off-by: Ingo Molnar <mingo@kernel.org>
19 Apr, 2017
1 commit
-
A group of Linux kernel hackers reported chasing a bug that resulted
from their assumption that SLAB_DESTROY_BY_RCU provided an existence
guarantee, that is, that no block from such a slab would be reallocated
during an RCU read-side critical section. Of course, that is not the
case. Instead, SLAB_DESTROY_BY_RCU only prevents freeing of an entire
slab of blocks.However, there is a phrase for this, namely "type safety". This commit
therefore renames SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU in order
to avoid future instances of this sort of confusion.Signed-off-by: Paul E. McKenney
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Cc: Andrew Morton
Cc:
Acked-by: Johannes Weiner
Acked-by: Vlastimil Babka
[ paulmck: Add comments mentioning the old name, as requested by Eric
Dumazet, in order to help people familiar with the old name find
the new one. ]
Acked-by: David Rientjes
16 Mar, 2017
1 commit
-
In journal_init_common(), if we failed to allocate the j_wbuf array, or
if we failed to create the buffer_head for the journal superblock, we
leaked the memory allocated for the revocation tables. Fix this.Cc: stable@vger.kernel.org # 4.9
Fixes: f0c9fd5458bacf7b12a9a579a727dc740cbe047e
Signed-off-by: Eric Biggers
Signed-off-by: Theodore Ts'o
Reviewed-by: Jan Kara
21 Feb, 2017
1 commit
-
Pull ext4 updates from Ted Ts'o:
"For this cycle we add support for the shutdown ioctl, which is
primarily used for testing, but which can be useful on production
systems when a scratch volume is being destroyed and the data on it
doesn't need to be saved.This found (and we fixed) a number of bugs with ext4's recovery to
corrupted file system --- the bugs increased the amount of data that
could be potentially lost, and in the case of the inline data feature,
could cause the kernel to BUG.Also included are a number of other bug fixes, including in ext4's
fscrypt, DAX, inline data support"* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (26 commits)
ext4: rename EXT4_IOC_GOINGDOWN to EXT4_IOC_SHUTDOWN
ext4: fix fencepost in s_first_meta_bg validation
ext4: don't BUG when truncating encrypted inodes on the orphan list
ext4: do not use stripe_width if it is not set
ext4: fix stripe-unaligned allocations
dax: assert that i_rwsem is held exclusive for writes
ext4: fix DAX write locking
ext4: add EXT4_IOC_GOINGDOWN ioctl
ext4: add shutdown bit and check for it
ext4: rename s_resize_flags to s_ext4_flags
ext4: return EROFS if device is r/o and journal replay is needed
ext4: preserve the needs_recovery flag when the journal is aborted
jbd2: don't leak modified metadata buffers on an aborted journal
ext4: fix inline data error paths
ext4: move halfmd4 into hash.c directly
ext4: fix use-after-iput when fscrypt contexts are inconsistent
jbd2: fix use after free in kjournald2()
ext4: fix data corruption in data=journal mode
ext4: trim allocation requests to group size
ext4: replace BUG_ON with WARN_ON in mb_find_extent()
...
05 Feb, 2017
1 commit
-
If the journal has been aborted, we shouldn't mark the underlying
buffer head as dirty, since that will cause the metadata block to get
modified. And if the journal has been aborted, we shouldn't allow
this since it will almost certainly lead to a corrupted file system.Signed-off-by: Theodore Ts'o
Cc: stable@vger.kernel.org
02 Feb, 2017
1 commit
-
Below is the synchronization issue between unmount and kjournald2
contexts, which results into use after free issue in kjournald2().
Fix this issue by using journal->j_state_lock to synchronize the
wait_event() done in journal_kill_thread() and the wake_up() done
in kjournald2().TASK 1:
umount cmd:
|--jbd2_journal_destroy() {
|--journal_kill_thread() {
write_lock(&journal->j_state_lock);
journal->j_flags |= JBD2_UNMOUNT;
...
write_unlock(&journal->j_state_lock);
wake_up(&journal->j_wait_commit); TASK 2 wakes up here:
kjournald2() {
...
checks JBD2_UNMOUNT flag and calls goto end-loop;
...
end_loop:
write_unlock(&journal->j_state_lock);
journal->j_task = NULL; --> If this thread gets
pre-empted here, then TASK 1 wait_event will
exit even before this thread is completely
done.
wait_event(journal->j_wait_done_commit, journal->j_task == NULL);
...
write_lock(&journal->j_state_lock);
write_unlock(&journal->j_state_lock);
}
|--kfree(journal);
}
}
wake_up(&journal->j_wait_done_commit); --> this step
now results into use after free issue.
}Signed-off-by: Sahitya Tummala
Signed-off-by: Theodore Ts'o
14 Jan, 2017
1 commit
-
When an ext4 fs is bogged down by a lot of metadata IOs (in the
reported case, it was deletion of millions of files, but any massive
amount of journal writes would do), after the journal is filled up,
tasks which try to access the filesystem and aren't currently
performing the journal writes end up waiting in
__jbd2_log_wait_for_space() for journal->j_checkpoint_mutex.Because those mutex sleeps aren't marked as iowait, this condition can
lead to misleadingly low iowait and /proc/stat:procs_blocked. While
iowait propagation is far from strict, this condition can be triggered
fairly easily and annotating these sleeps correctly helps initial
diagnosis quite a bit.Use the new mutex_lock_io() for journal->j_checkpoint_mutex so that
these sleeps are properly marked as iowait.Reported-by: Mingbo Wan
Signed-off-by: Tejun Heo
Signed-off-by: Peter Zijlstra (Intel)
Cc: Andreas Dilger
Cc: Andrew Morton
Cc: Jan Kara
Cc: Jens Axboe
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Theodore Ts'o
Cc: Thomas Gleixner
Cc: kernel-team@fb.com
Link: http://lkml.kernel.org/r/1477673892-28940-5-git-send-email-tj@kernel.org
Signed-off-by: Ingo Molnar
25 Dec, 2016
1 commit
-
This was entirely automated, using the script by Al:
PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*'
sed -i -e "s!$PATT!#include !" \
$(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h)to do the replacement at the end of the merge window.
Requested-by: Al Viro
Signed-off-by: Linus Torvalds
14 Dec, 2016
1 commit
-
Pull block layer updates from Jens Axboe:
"This is the main block pull request this series. Contrary to previous
release, I've kept the core and driver changes in the same branch. We
always ended up having dependencies between the two for obvious
reasons, so makes more sense to keep them together. That said, I'll
probably try and keep more topical branches going forward, especially
for cycles that end up being as busy as this one.The major parts of this pull request is:
- Improved support for O_DIRECT on block devices, with a small
private implementation instead of using the pig that is
fs/direct-io.c. From Christoph.- Request completion tracking in a scalable fashion. This is utilized
by two components in this pull, the new hybrid polling and the
writeback queue throttling code.- Improved support for polling with O_DIRECT, adding a hybrid mode
that combines pure polling with an initial sleep. From me.- Support for automatic throttling of writeback queues on the block
side. This uses feedback from the device completion latencies to
scale the queue on the block side up or down. From me.- Support from SMR drives in the block layer and for SD. From Hannes
and Shaun.- Multi-connection support for nbd. From Josef.
- Cleanup of request and bio flags, so we have a clear split between
which are bio (or rq) private, and which ones are shared. From
Christoph.- A set of patches from Bart, that improve how we handle queue
stopping and starting in blk-mq.- Support for WRITE_ZEROES from Chaitanya.
- Lightnvm updates from Javier/Matias.
- Supoort for FC for the nvme-over-fabrics code. From James Smart.
- A bunch of fixes from a whole slew of people, too many to name
here"* 'for-4.10/block' of git://git.kernel.dk/linux-block: (182 commits)
blk-stat: fix a few cases of missing batch flushing
blk-flush: run the queue when inserting blk-mq flush
elevator: make the rqhash helpers exported
blk-mq: abstract out blk_mq_dispatch_rq_list() helper
blk-mq: add blk_mq_start_stopped_hw_queue()
block: improve handling of the magic discard payload
blk-wbt: don't throttle discard or write zeroes
nbd: use dev_err_ratelimited in io path
nbd: reset the setup task for NBD_CLEAR_SOCK
nvme-fabrics: Add FC LLDD loopback driver to test FC-NVME
nvme-fabrics: Add target support for FC transport
nvme-fabrics: Add host support for FC transport
nvme-fabrics: Add FC transport LLDD api definitions
nvme-fabrics: Add FC transport FC-NVME definitions
nvme-fabrics: Add FC transport error codes to nvme.h
Add type 0x28 NVME type code to scsi fc headers
nvme-fabrics: patch target code in prep for FC transport support
nvme-fabrics: set sqe.command_id in core not transports
parser: add u64 number parser
nvme-rdma: align to generic ib_event logging helper
...
01 Nov, 2016
1 commit
-
Remove the WRITE_* and READ_SYNC wrappers, and just use the flags
directly. Where applicable this also drops usage of the
bio_set_op_attrs wrapper.Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe
13 Oct, 2016
1 commit
-
When 'jh->b_transaction == transaction' (asserted by below)
J_ASSERT_JH(jh, (jh->b_transaction == transaction || ...
'journal->j_list_lock' will be incorrectly unlocked, since
the the lock is aquired only at the end of if / else-if
statements (missing the else case).Signed-off-by: Taesoo Kim
Signed-off-by: Theodore Ts'o
Reviewed-by: Andreas Dilger
Fixes: 6e4862a5bb9d12be87e4ea5d9a60836ebed71d28
Cc: stable@vger.kernel.org # 3.14+
12 Oct, 2016
1 commit
-
The mapping_set_error() helper sets the correct AS_ flag for the mapping
so there is no reason to open code it. Use the helper directly.[akpm@linux-foundation.org: be honest about conversion from -ENXIO to -EIO]
Link: http://lkml.kernel.org/r/20160912111608.2588-2-mhocko@kernel.org
Signed-off-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
22 Sep, 2016
1 commit
-
Thomas has reported a lockdep splat hitting in
add_transaction_credits(). The problem is that that function calls
jbd2_might_wait_for_commit() while holding j_state_lock which is wrong
(we do not really wait for transaction commit while holding that lock).Fix the problem by moving jbd2_might_wait_for_commit() into places where
we are ready to wait for transaction commit and thus j_state_lock is
unlocked.Cc: stable@vger.kernel.org
Fixes: 1eaa566d368b214d99cbb973647c1b0b8102a9ae
Reported-by: Thomas Gleixner
Signed-off-by: Jan Kara
Signed-off-by: Theodore Ts'o
16 Sep, 2016
1 commit
-
There are some repetitive code in jbd2_journal_init_dev() and
jbd2_journal_init_inode(). So this patch moves the common code into
journal_init_common() helper to simplify the code. And fix the coding
style warnings reported by checkpatch.pl by the way.Signed-off-by: Geliang Tang
Signed-off-by: Theodore Ts'o
Reviewed-by: Jan Kara
27 Jul, 2016
2 commits
-
Pull ext4 updates from Ted Ts'o:
"The major change this cycle is deleting ext4's copy of the file system
encryption code and switching things over to using the copies in
fs/crypto. I've updated the MAINTAINERS file to add an entry for
fs/crypto listing Jaeguk Kim and myself as the maintainers.There are also a number of bug fixes, most notably for some problems
found by American Fuzzy Lop (AFL) courtesy of Vegard Nossum. Also
fixed is a writeback deadlock detected by generic/130, and some
potential races in the metadata checksum code"* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (21 commits)
ext4: verify extent header depth
ext4: short-cut orphan cleanup on error
ext4: fix reference counting bug on block allocation error
MAINTAINRES: fs-crypto maintainers update
ext4 crypto: migrate into vfs's crypto engine
ext2: fix filesystem deadlock while reading corrupted xattr block
ext4: fix project quota accounting without quota limits enabled
ext4: validate s_reserved_gdt_blocks on mount
ext4: remove unused page_idx
ext4: don't call ext4_should_journal_data() on the journal inode
ext4: Fix WARN_ON_ONCE in ext4_commit_super()
ext4: fix deadlock during page writeback
ext4: correct error value of function verifying dx checksum
ext4: avoid modifying checksum fields directly during checksum verification
ext4: check for extents that wrap around
jbd2: make journal y2038 safe
jbd2: track more dependencies on transaction commit
jbd2: move lockdep tracking to journal_s
jbd2: move lockdep instrumentation for jbd2 handles
ext4: respect the nobarrier mount option in nojournal mode
... -
Pull core block updates from Jens Axboe:
- the big change is the cleanup from Mike Christie, cleaning up our
uses of command types and modified flags. This is what will throw
some merge conflicts- regression fix for the above for btrfs, from Vincent
- following up to the above, better packing of struct request from
Christoph- a 2038 fix for blktrace from Arnd
- a few trivial/spelling fixes from Bart Van Assche
- a front merge check fix from Damien, which could cause issues on
SMR drives- Atari partition fix from Gabriel
- convert cfq to highres timers, since jiffies isn't granular enough
for some devices these days. From Jan and Jeff- CFQ priority boost fix idle classes, from me
- cleanup series from Ming, improving our bio/bvec iteration
- a direct issue fix for blk-mq from Omar
- fix for plug merging not involving the IO scheduler, like we do for
other types of merges. From Tahsin- expose DAX type internally and through sysfs. From Toshi and Yigal
* 'for-4.8/core' of git://git.kernel.dk/linux-block: (76 commits)
block: Fix front merge check
block: do not merge requests without consulting with io scheduler
block: Fix spelling in a source code comment
block: expose QUEUE_FLAG_DAX in sysfs
block: add QUEUE_FLAG_DAX for devices to advertise their DAX support
Btrfs: fix comparison in __btrfs_map_block()
block: atari: Return early for unsupported sector size
Doc: block: Fix a typo in queue-sysfs.txt
cfq-iosched: Charge at least 1 jiffie instead of 1 ns
cfq-iosched: Fix regression in bonnie++ rewrite performance
cfq-iosched: Convert slice_resid from u64 to s64
block: Convert fifo_time from ulong to u64
blktrace: avoid using timespec
block/blk-cgroup.c: Declare local symbols static
block/bio-integrity.c: Add #include "blk.h"
block/partition-generic.c: Remove a set-but-not-used variable
block: bio: kill BIO_MAX_SIZE
cfq-iosched: temporarily boost queue priority for idle classes
block: drbd: avoid to use BIO_MAX_SIZE
block: bio: remove BIO_MAX_SECTORS
...
30 Jun, 2016
3 commits
-
The jbd2 journal stores the commit time in 64-bit seconds and 32-bit
nanoseconds, which avoids an overflow in 2038, but it gets the numbers
from current_kernel_time(), which uses 'long' seconds on 32-bit
architectures.This simply changes the code to call current_kernel_time64() so
we use 64-bit seconds consistently.Signed-off-by: Arnd Bergmann
Signed-off-by: Theodore Ts'o
Reviewed-by: Jan Kara
Cc: stable@vger.kernel.org -
So far we were tracking only dependency on transaction commit due to
starting a new handle (which may require commit to start a new
transaction). Now add tracking also for other cases where we wait for
transaction commit. This way lockdep can catch deadlocks e. g. because we
call jbd2_journal_stop() for a synchronous handle with some locks held
which rank below transaction start.Signed-off-by: Jan Kara
Signed-off-by: Theodore Ts'o -
Currently lockdep map is tracked in each journal handle. To be able to
expand lockdep support to cover also other cases where we depend on
transaction commit and where handle is not available, move lockdep map
into struct journal_s. Since this makes the lockdep map shared for all
handles, we have to use rwsem_acquire_read() for acquisitions now.Signed-off-by: Jan Kara
Signed-off-by: Theodore Ts'o