13 Oct, 2014
1 commit
-
Pull vfs updates from Al Viro:
"The big thing in this pile is Eric's unmount-on-rmdir series; we
finally have everything we need for that. The final piece of prereqs
is delayed mntput() - now filesystem shutdown always happens on
shallow stack.Other than that, we have several new primitives for iov_iter (Matt
Wilcox, culled from his XIP-related series) pushing the conversion to
->read_iter()/ ->write_iter() a bit more, a bunch of fs/dcache.c
cleanups and fixes (including the external name refcounting, which
gives consistent behaviour of d_move() wrt procfs symlinks for long
and short names alike) and assorted cleanups and fixes all over the
place.This is just the first pile; there's a lot of stuff from various
people that ought to go in this window. Starting with
unionmount/overlayfs mess... ;-/"* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (60 commits)
fs/file_table.c: Update alloc_file() comment
vfs: Deduplicate code shared by xattr system calls operating on paths
reiserfs: remove pointless forward declaration of struct nameidata
don't need that forward declaration of struct nameidata in dcache.h anymore
take dname_external() into fs/dcache.c
let path_init() failures treated the same way as subsequent link_path_walk()
fix misuses of f_count() in ppp and netlink
ncpfs: use list_for_each_entry() for d_subdirs walk
vfs: move getname() from callers to do_mount()
gfs2_atomic_open(): skip lookups on hashed dentry
[infiniband] remove pointless assignments
gadgetfs: saner API for gadgetfs_create_file()
f_fs: saner API for ffs_sb_create_file()
jfs: don't hash direct inode
[s390] remove pointless assignment of ->f_op in vmlogrdr ->open()
ecryptfs: ->f_op is never NULL
android: ->f_op is never NULL
nouveau: __iomem misannotations
missing annotation in fs/file.c
fs: namespace: suppress 'may be used uninitialized' warnings
...
09 Oct, 2014
1 commit
-
total_objects could be 0 and is used as a denom.
While total_objects is a "long", total_objects == 0 unlikely happens for
3.12 and later kernels because 32-bit architectures would not be able to
hold (1 << 32) objects. However, total_objects == 0 may happen for kernels
between 3.1 and 3.11 because total_objects in prune_super() was an "int"
and (e.g.) x86_64 architecture might be able to hold (1 << 32) objects.Signed-off-by: Tetsuo Handa
Reviewed-by: Christoph Hellwig
Cc: stable # 3.1+
Signed-off-by: Al Viro
08 Sep, 2014
1 commit
-
Percpu allocator now supports allocation mask. Add @gfp to
percpu_counter_init() so that !GFP_KERNEL allocation masks can be used
with percpu_counters too.We could have left percpu_counter_init() alone and added
percpu_counter_init_gfp(); however, the number of users isn't that
high and introducing _gfp variants to all percpu data structures would
be quite ugly, so let's just do the conversion. This is the one with
the most users. Other percpu data structures are a lot easier to
convert.This patch doesn't make any functional difference.
Signed-off-by: Tejun Heo
Acked-by: Jan Kara
Acked-by: "David S. Miller"
Cc: x86@kernel.org
Cc: Jens Axboe
Cc: "Theodore Ts'o"
Cc: Alexander Viro
Cc: Andrew Morton
14 Aug, 2014
1 commit
-
Pull quota, reiserfs, UDF updates from Jan Kara:
"Scalability improvements for quota, a few reiserfs fixes, and couple
of misc cleanups (udf, ext2)"* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
reiserfs: Fix use after free in journal teardown
reiserfs: fix corruption introduced by balance_leaf refactor
udf: avoid redundant memcpy when writing data in ICB
fs/udf: re-use hex_asc_upper_{hi,lo} macros
fs/quota: kernel-doc warning fixes
udf: use linux/uaccess.h
fs/ext2/super.c: Drop memory allocation cast
quota: remove dqptr_sem
quota: simplify remove_inode_dquot_ref()
quota: avoid unnecessary dqget()/dqput() calls
quota: protect Q_GETFMT by dqonoff_mutex
08 Aug, 2014
3 commits
-
These externs belong in fs/internal.h. Rename (they are not acct-specific
anymore) and move them over there.Signed-off-by: Al Viro
-
just repeat the frozen check after regaining it, and check that sb
is still alive. If several threads hit acct_auto_close() at the
same time, acct_auto_close() will survive that just fine. And we
really don't want to play with writes and closing the file with
->s_umount held exclusive - it's a deadlock country.Signed-off-by: Al Viro
-
Put these suckers on per-vfsmount and per-superblock lists instead.
Note: right now it's still acct_lock for everything, but that's
going to change.Signed-off-by: Al Viro
16 Jul, 2014
1 commit
-
Remove dqptr_sem to make quota code scalable: Remove the dqptr_sem,
accessing inode->i_dquot now protected by dquot_srcu, and changing
inode->i_dquot is now serialized by dq_data_lock.Signed-off-by: Lai Siyao
Signed-off-by: Niu Yawei
Signed-off-by: Jan Kara
05 Jun, 2014
2 commits
-
We remove the call to grab_super_passive in call to super_cache_count.
This becomes a scalability bottleneck as multiple threads are trying to do
memory reclamation, e.g. when we are doing large amount of file read and
page cache is under pressure. The cached objects quickly got reclaimed
down to 0 and we are aborting the cache_scan() reclaim. But counting
creates a log jam acquiring the sb_lock.We are holding the shrinker_rwsem which ensures the safety of call to
list_lru_count_node() and s_op->nr_cached_objects. The shrinker is
unregistered now before ->kill_sb() so the operation is safe when we are
doing unmount.The impact will depend heavily on the machine and the workload but for a
small machine using postmark tuned to use 4xRAM size the results were3.15.0-rc5 3.15.0-rc5
vanilla shrinker-v1r1
Ops/sec Transactions 21.00 ( 0.00%) 24.00 ( 14.29%)
Ops/sec FilesCreate 39.00 ( 0.00%) 44.00 ( 12.82%)
Ops/sec CreateTransact 10.00 ( 0.00%) 12.00 ( 20.00%)
Ops/sec FilesDeleted 6202.00 ( 0.00%) 6202.00 ( 0.00%)
Ops/sec DeleteTransact 11.00 ( 0.00%) 12.00 ( 9.09%)
Ops/sec DataRead/MB 25.97 ( 0.00%) 29.10 ( 12.05%)
Ops/sec DataWrite/MB 49.99 ( 0.00%) 56.02 ( 12.06%)ffsb running in a configuration that is meant to simulate a mail server showed
3.15.0-rc5 3.15.0-rc5
vanilla shrinker-v1r1
Ops/sec readall 9402.63 ( 0.00%) 9567.97 ( 1.76%)
Ops/sec create 4695.45 ( 0.00%) 4735.00 ( 0.84%)
Ops/sec delete 173.72 ( 0.00%) 179.83 ( 3.52%)
Ops/sec Transactions 14271.80 ( 0.00%) 14482.81 ( 1.48%)
Ops/sec Read 37.00 ( 0.00%) 37.60 ( 1.62%)
Ops/sec Write 18.20 ( 0.00%) 18.30 ( 0.55%)Signed-off-by: Tim Chen
Signed-off-by: Mel Gorman
Cc: Johannes Weiner
Cc: Hugh Dickins
Cc: Dave Chinner
Tested-by: Yuanhan Liu
Cc: Bob Liu
Cc: Jan Kara
Acked-by: Rik van Riel
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This series is aimed at regressions noticed during reclaim activity. The
first two patches are shrinker patches that were posted ages ago but never
merged for reasons that are unclear to me. I'm posting them again to see
if there was a reason they were dropped or if they just got lost. Dave?
Time? The last patch adjusts proportional reclaim. Yuanhan Liu, can you
retest the vm scalability test cases on a larger machine? Hugh, does this
work for you on the memcg test cases?Based on ext4, I get the following results but unfortunately my larger
test machines are all unavailable so this is based on a relatively small
machine.postmark
3.15.0-rc5 3.15.0-rc5
vanilla proportion-v1r4
Ops/sec Transactions 21.00 ( 0.00%) 25.00 ( 19.05%)
Ops/sec FilesCreate 39.00 ( 0.00%) 45.00 ( 15.38%)
Ops/sec CreateTransact 10.00 ( 0.00%) 12.00 ( 20.00%)
Ops/sec FilesDeleted 6202.00 ( 0.00%) 6202.00 ( 0.00%)
Ops/sec DeleteTransact 11.00 ( 0.00%) 12.00 ( 9.09%)
Ops/sec DataRead/MB 25.97 ( 0.00%) 30.02 ( 15.59%)
Ops/sec DataWrite/MB 49.99 ( 0.00%) 57.78 ( 15.58%)ffsb (mail server simulator)
3.15.0-rc5 3.15.0-rc5
vanilla proportion-v1r4
Ops/sec readall 9402.63 ( 0.00%) 9805.74 ( 4.29%)
Ops/sec create 4695.45 ( 0.00%) 4781.39 ( 1.83%)
Ops/sec delete 173.72 ( 0.00%) 177.23 ( 2.02%)
Ops/sec Transactions 14271.80 ( 0.00%) 14764.37 ( 3.45%)
Ops/sec Read 37.00 ( 0.00%) 38.50 ( 4.05%)
Ops/sec Write 18.20 ( 0.00%) 18.50 ( 1.65%)dd of a large file
3.15.0-rc5 3.15.0-rc5
vanilla proportion-v1r4
WallTime DownloadTar 75.00 ( 0.00%) 61.00 ( 18.67%)
WallTime DD 423.00 ( 0.00%) 401.00 ( 5.20%)
WallTime Delete 2.00 ( 0.00%) 5.00 (-150.00%)stutter (times mmap latency during large amounts of IO)
3.15.0-rc5 3.15.0-rc5
vanilla proportion-v1r4
Unit >5ms Delays 80252.0000 ( 0.00%) 81523.0000 ( -1.58%)
Unit Mmap min 8.2118 ( 0.00%) 8.3206 ( -1.33%)
Unit Mmap mean 17.4614 ( 0.00%) 17.2868 ( 1.00%)
Unit Mmap stddev 24.9059 ( 0.00%) 34.6771 (-39.23%)
Unit Mmap max 2811.6433 ( 0.00%) 2645.1398 ( 5.92%)
Unit Mmap 90% 20.5098 ( 0.00%) 18.3105 ( 10.72%)
Unit Mmap 93% 22.9180 ( 0.00%) 20.1751 ( 11.97%)
Unit Mmap 95% 25.2114 ( 0.00%) 22.4988 ( 10.76%)
Unit Mmap 99% 46.1430 ( 0.00%) 43.5952 ( 5.52%)
Unit Ideal Tput 85.2623 ( 0.00%) 78.8906 ( 7.47%)
Unit Tput min 44.0666 ( 0.00%) 43.9609 ( 0.24%)
Unit Tput mean 45.5646 ( 0.00%) 45.2009 ( 0.80%)
Unit Tput stddev 0.9318 ( 0.00%) 1.1084 (-18.95%)
Unit Tput max 46.7375 ( 0.00%) 46.7539 ( -0.04%)This patch (of 3):
We will like to unregister the sb shrinker before ->kill_sb(). This will
allow cached objects to be counted without call to grab_super_passive() to
update ref count on sb. We want to avoid locking during memory
reclamation especially when we are skipping the memory reclaim when we are
out of cached objects.This is safe because grab_super_passive does a try-lock on the
sb->s_umount now, and so if we are in the unmount process, it won't ever
block. That means what used to be a deadlock and races we were avoiding
by using grab_super_passive() is now:shrinker umount
down_read(shrinker_rwsem)
down_write(sb->s_umount)
shrinker_unregister
down_write(shrinker_rwsem)
grab_super_passive(sb)
down_read_trylock(sb->s_umount)
....
up_read(shrinker_rwsem)
up_write(shrinker_rwsem)
->kill_sb()
....So it is safe to deregister the shrinker before ->kill_sb().
Signed-off-by: Tim Chen
Signed-off-by: Mel Gorman
Cc: Johannes Weiner
Cc: Hugh Dickins
Cc: Dave Chinner
Tested-by: Yuanhan Liu
Cc: Bob Liu
Cc: Jan Kara
Acked-by: Rik van Riel
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
17 Apr, 2014
1 commit
-
Commit 9e30cc9595303b27b48 removed an internal mount. This
has the side-effect that rootfs now has FSID 0. Many
userspace utilities assume that st_dev in struct stat
is never 0, so this change breaks a number of tools in
early userspace.Since we don't know how many userspace programs are affected,
make sure that FSID is at least 1.References: http://article.gmane.org/gmane.linux.kernel/1666905
References: http://permalink.gmane.org/gmane.linux.utilities.util-linux-ng/8557
Cc: 3.14
Signed-off-by: Thomas Bächler
Acked-by: Tejun Heo
Acked-by: H. Peter Anvin
Tested-by: Alexandre Demers
Signed-off-by: Greg Kroah-Hartman
13 Mar, 2014
1 commit
-
Previously, the no-op "mount -o mount /dev/xxx" operation when the
file system is already mounted read-write causes an implied,
unconditional syncfs(). This seems pretty stupid, and it's certainly
documented or guaraunteed to do this, nor is it particularly useful,
except in the case where the file system was mounted rw and is getting
remounted read-only.However, it's possible that there might be some file systems that are
actually depending on this behavior. In most file systems, it's
probably fine to only call sync_filesystem() when transitioning from
read-write to read-only, and there are some file systems where this is
not needed at all (for example, for a pseudo-filesystem or something
like romfs).Signed-off-by: "Theodore Ts'o"
Cc: linux-fsdevel@vger.kernel.org
Cc: Christoph Hellwig
Cc: Artem Bityutskiy
Cc: Adrian Hunter
Cc: Evgeniy Dushistov
Cc: Jan Kara
Cc: OGAWA Hirofumi
Cc: Anders Larsen
Cc: Phillip Lougher
Cc: Kees Cook
Cc: Mikulas Patocka
Cc: Petr Vandrovec
Cc: xfs@oss.sgi.com
Cc: linux-btrfs@vger.kernel.org
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Cc: codalist@coda.cs.cmu.edu
Cc: linux-ext4@vger.kernel.org
Cc: linux-f2fs-devel@lists.sourceforge.net
Cc: fuse-devel@lists.sourceforge.net
Cc: cluster-devel@redhat.com
Cc: linux-mtd@lists.infradead.org
Cc: jfs-discussion@lists.sourceforge.net
Cc: linux-nfs@vger.kernel.org
Cc: linux-nilfs@vger.kernel.org
Cc: linux-ntfs-dev@lists.sourceforge.net
Cc: ocfs2-devel@oss.oracle.com
Cc: reiserfs-devel@vger.kernel.org
01 Feb, 2014
1 commit
-
Move sync_filesystem() after sb_prepare_remount_readonly(). If writers
sneak in anywhere from sync_filesystem() to sb_prepare_remount_readonly()
it can cause inodes to be dirtied and writeback to occur well after
sys_mount() has completely successfully.This was spotted by corrupted ubifs filesystems on reboot, but appears
that it can cause issues with any filesystem using writeback.Cc: Artem Bityutskiy
Cc: Christoph Hellwig
Cc: Alexander Viro
CC: Richard Weinberger
Co-authored-by: Richard Weinberger
Signed-off-by: Andrew Ruder
Signed-off-by: Al Viro
22 Jan, 2014
1 commit
-
On fail path alloc_super() calls destroy_super(), which issues a warning
if the sb's s_mounts list is not empty, in particular if it has not been
initialized. That said s_mounts must be initialized in alloc_super()
before any possible failure, but currently it is initialized close to
the end of the function leading to a useless warning dumped to log if
either percpu_counter_init() or list_lru_init() fails. Let's fix this.Signed-off-by: Vladimir Davydov
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
09 Nov, 2013
1 commit
-
The only thing we need it for is alt-sysrq-r (emergency remount r/o)
and these days we can do just as well without going through the
list of files.Signed-off-by: Al Viro
25 Oct, 2013
2 commits
-
Signed-off-by: Al Viro
-
Signed-off-by: Al Viro
02 Oct, 2013
1 commit
-
Freeing ->s_{inode,dentry}_lru in deactivate_locked_super() is wrong;
the right place is destroy_super(). As it is, we leak them if sget()
decides that new superblock it has allocated (and never shown to
anybody) isn't needed and should be freed.Signed-off-by: Al Viro
11 Sep, 2013
8 commits
-
This patch adds the missing call to list_lru_destroy (spotted by Li Zhong)
and moves the deletion to after the shrinker is unregistered, as correctly
spotted by DaveSigned-off-by: Glauber Costa
Cc: Michal Hocko
Cc: Dave Chinner
Signed-off-by: Andrew Morton
Signed-off-by: Al Viro -
We currently use a compile-time constant to size the node array for the
list_lru structure. Due to this, we don't need to allocate any memory at
initialization time. But as a consequence, the structures that contain
embedded list_lru lists can become way too big (the superblock for
instance contains two of them).This patch aims at ameliorating this situation by dynamically allocating
the node arrays with the firmware provided nr_node_ids.Signed-off-by: Glauber Costa
Cc: Dave Chinner
Cc: Mel Gorman
Cc: "Theodore Ts'o"
Cc: Adrian Hunter
Cc: Al Viro
Cc: Artem Bityutskiy
Cc: Arve Hjønnevåg
Cc: Carlos Maiolino
Cc: Christoph Hellwig
Cc: Chuck Lever
Cc: Daniel Vetter
Cc: David Rientjes
Cc: Gleb Natapov
Cc: Greg Thelen
Cc: J. Bruce Fields
Cc: Jan Kara
Cc: Jerome Glisse
Cc: John Stultz
Cc: KAMEZAWA Hiroyuki
Cc: Kent Overstreet
Cc: Kirill A. Shutemov
Cc: Marcelo Tosatti
Cc: Mel Gorman
Cc: Steven Whitehouse
Cc: Thomas Hellstrom
Cc: Trond Myklebust
Signed-off-by: Andrew Morton
Signed-off-by: Al Viro -
Now that the shrinker is passing a node in the scan control structure, we
can pass this to the the generic LRU list code to isolate reclaim to the
lists on matching nodes.Signed-off-by: Dave Chinner
Signed-off-by: Glauber Costa
Acked-by: Mel Gorman
Cc: "Theodore Ts'o"
Cc: Adrian Hunter
Cc: Al Viro
Cc: Artem Bityutskiy
Cc: Arve Hjønnevåg
Cc: Carlos Maiolino
Cc: Christoph Hellwig
Cc: Chuck Lever
Cc: Daniel Vetter
Cc: David Rientjes
Cc: Gleb Natapov
Cc: Greg Thelen
Cc: J. Bruce Fields
Cc: Jan Kara
Cc: Jerome Glisse
Cc: John Stultz
Cc: KAMEZAWA Hiroyuki
Cc: Kent Overstreet
Cc: Kirill A. Shutemov
Cc: Marcelo Tosatti
Cc: Mel Gorman
Cc: Steven Whitehouse
Cc: Thomas Hellstrom
Cc: Trond Myklebust
Signed-off-by: Andrew Morton
Signed-off-by: Al Viro -
[glommer@openvz.org: don't reintroduce double decrement of nr_unused_dentries, adapted for new LRU return codes]
Signed-off-by: Dave Chinner
Signed-off-by: Glauber Costa
Cc: "Theodore Ts'o"
Cc: Adrian Hunter
Cc: Al Viro
Cc: Artem Bityutskiy
Cc: Arve Hjønnevåg
Cc: Carlos Maiolino
Cc: Christoph Hellwig
Cc: Chuck Lever
Cc: Daniel Vetter
Cc: David Rientjes
Cc: Gleb Natapov
Cc: Greg Thelen
Cc: J. Bruce Fields
Cc: Jan Kara
Cc: Jerome Glisse
Cc: John Stultz
Cc: KAMEZAWA Hiroyuki
Cc: Kent Overstreet
Cc: Kirill A. Shutemov
Cc: Marcelo Tosatti
Cc: Mel Gorman
Cc: Steven Whitehouse
Cc: Thomas Hellstrom
Cc: Trond Myklebust
Signed-off-by: Andrew Morton
Signed-off-by: Al Viro -
[glommer@openvz.org: adapted for new LRU return codes]
Signed-off-by: Dave Chinner
Signed-off-by: Glauber Costa
Cc: "Theodore Ts'o"
Cc: Adrian Hunter
Cc: Al Viro
Cc: Artem Bityutskiy
Cc: Arve Hjønnevåg
Cc: Carlos Maiolino
Cc: Christoph Hellwig
Cc: Chuck Lever
Cc: Daniel Vetter
Cc: David Rientjes
Cc: Gleb Natapov
Cc: Greg Thelen
Cc: J. Bruce Fields
Cc: Jan Kara
Cc: Jerome Glisse
Cc: John Stultz
Cc: KAMEZAWA Hiroyuki
Cc: Kent Overstreet
Cc: Kirill A. Shutemov
Cc: Marcelo Tosatti
Cc: Mel Gorman
Cc: Steven Whitehouse
Cc: Thomas Hellstrom
Cc: Trond Myklebust
Signed-off-by: Andrew MortonSigned-off-by: Al Viro
-
Convert superblock shrinker to use the new count/scan API, and propagate
the API changes through to the filesystem callouts. The filesystem
callouts already use a count/scan API, so it's just changing counters to
longs to match the VM API.This requires the dentry and inode shrinker callouts to be converted to
the count/scan API. This is mainly a mechanical change.[glommer@openvz.org: use mult_frac for fractional proportions, build fixes]
Signed-off-by: Dave Chinner
Signed-off-by: Glauber Costa
Acked-by: Mel Gorman
Cc: "Theodore Ts'o"
Cc: Adrian Hunter
Cc: Al Viro
Cc: Artem Bityutskiy
Cc: Arve Hjønnevåg
Cc: Carlos Maiolino
Cc: Christoph Hellwig
Cc: Chuck Lever
Cc: Daniel Vetter
Cc: David Rientjes
Cc: Gleb Natapov
Cc: Greg Thelen
Cc: J. Bruce Fields
Cc: Jan Kara
Cc: Jerome Glisse
Cc: John Stultz
Cc: KAMEZAWA Hiroyuki
Cc: Kent Overstreet
Cc: Kirill A. Shutemov
Cc: Marcelo Tosatti
Cc: Mel Gorman
Cc: Steven Whitehouse
Cc: Thomas Hellstrom
Cc: Trond Myklebust
Signed-off-by: Andrew MortonSigned-off-by: Al Viro
-
With the dentry LRUs being per-sb structures, there is no real need for
a global dentry_lru_lock. The locking can be made more fine-grained by
moving to a per-sb LRU lock, isolating the LRU operations of different
filesytsems completely from each other. The need for this is independent
of any performance consideration that may arise: in the interest of
abstracting the lru operations away, it is mandatory that each lru works
around its own lock instead of a global lock for all of them.[glommer@openvz.org: updated changelog ]
Signed-off-by: Dave Chinner
Signed-off-by: Glauber Costa
Reviewed-by: Christoph Hellwig
Acked-by: Mel Gorman
Cc: "Theodore Ts'o"
Cc: Adrian Hunter
Cc: Al Viro
Cc: Artem Bityutskiy
Cc: Arve Hjønnevåg
Cc: Carlos Maiolino
Cc: Christoph Hellwig
Cc: Chuck Lever
Cc: Daniel Vetter
Cc: David Rientjes
Cc: Gleb Natapov
Cc: Greg Thelen
Cc: J. Bruce Fields
Cc: Jan Kara
Cc: Jerome Glisse
Cc: John Stultz
Cc: KAMEZAWA Hiroyuki
Cc: Kent Overstreet
Cc: Kirill A. Shutemov
Cc: Marcelo Tosatti
Cc: Mel Gorman
Cc: Steven Whitehouse
Cc: Thomas Hellstrom
Cc: Trond Myklebust
Signed-off-by: Andrew MortonSigned-off-by: Al Viro
-
The sysctl knob sysctl_vfs_cache_pressure is used to determine which
percentage of the shrinkable objects in our cache we should actively try
to shrink.It works great in situations in which we have many objects (at least more
than 100), because the aproximation errors will be negligible. But if
this is not the case, specially when total_objects < 100, we may end up
concluding that we have no objects at all (total / 100 = 0, if total <
100).This is certainly not the biggest killer in the world, but may matter in
very low kernel memory situations.Signed-off-by: Glauber Costa
Reviewed-by: Carlos Maiolino
Acked-by: KAMEZAWA Hiroyuki
Acked-by: Mel Gorman
Cc: Dave Chinner
Cc: Al Viro
Cc: "Theodore Ts'o"
Cc: Adrian Hunter
Cc: Al Viro
Cc: Artem Bityutskiy
Cc: Arve Hjønnevåg
Cc: Carlos Maiolino
Cc: Christoph Hellwig
Cc: Chuck Lever
Cc: Daniel Vetter
Cc: David Rientjes
Cc: Gleb Natapov
Cc: Greg Thelen
Cc: J. Bruce Fields
Cc: Jan Kara
Cc: Jerome Glisse
Cc: John Stultz
Cc: KAMEZAWA Hiroyuki
Cc: Kent Overstreet
Cc: Kirill A. Shutemov
Cc: Marcelo Tosatti
Cc: Mel Gorman
Cc: Steven Whitehouse
Cc: Thomas Hellstrom
Cc: Trond Myklebust
Signed-off-by: Andrew Morton
Signed-off-by: Al Viro
08 Sep, 2013
1 commit
-
Signed-off-by: Al Viro
04 Sep, 2013
1 commit
-
Add support to the core direct-io code to defer AIO completions to user
context using a workqueue. This replaces opencoded and less efficient
code in XFS and ext4 (we save a memory allocation for each direct IO)
and will be needed to properly support O_(D)SYNC for AIO.The communication between the filesystem and the direct I/O code requires
a new buffer head flag, which is a bit ugly but not avoidable until the
direct I/O code stops abusing the buffer_head structure for communicating
with the filesystems.Currently this creates a per-superblock unbound workqueue for these
completions, which is taken from an earlier patch by Jan Kara. I'm
not really convinced about this use and would prefer a "normal" global
workqueue with a high concurrency limit, but this needs further discussion.JK: Fixed ext4 part, dynamic allocation of the workqueue.
Signed-off-by: Christoph Hellwig
Signed-off-by: Jan Kara
Signed-off-by: Al Viro
20 Jul, 2013
1 commit
-
Eric Sandeen has found a nasty livelock in sget() - take a mount(2) about
to fail. The superblock is on ->fs_supers, ->s_umount is held exclusive,
->s_active is 1. Along comes two more processes, trying to mount the same
thing; sget() in each is picking that superblock, bumping ->s_count and
trying to grab ->s_umount. ->s_active is 3 now. Original mount(2)
finally gets to deactivate_locked_super() on failure; ->s_active is 2,
superblock is still ->fs_supers because shutdown will *not* happen until
->s_active hits 0. ->s_umount is dropped and now we have two processes
chasing each other:
s_active = 2, A acquired ->s_umount, B blocked
A sees that the damn thing is stillborn, does deactivate_locked_super()
s_active = 1, A drops ->s_umount, B gets it
A restarts the search and finds the same superblock. And bumps it ->s_active.
s_active = 2, B holds ->s_umount, A blocked on trying to get it
... and we are in the earlier situation with A and B switched places.The root cause, of course, is that ->s_active should not grow until we'd
got MS_BORN. Then failing ->mount() will have deactivate_locked_super()
shut the damn thing down. Fortunately, it's easy to do - the key point
is that grab_super() is called only for superblocks currently on ->fs_supers,
so it can bump ->s_count and grab ->s_umount first, then check MS_BORN and
bump ->s_active; we must never increment ->s_count for superblocks past
->kill_sb(), but grab_super() is never called for those.The bug is pretty old; we would've caught it by now, if not for accidental
exclusion between sget() for block filesystems; the things like cgroup or
e.g. mtd-based filesystems don't have anything of that sort, so they get
bitten. The right way to deal with that is obviously to fix sget()...Signed-off-by: Al Viro
28 Feb, 2013
2 commits
-
I'm not sure why, but the hlist for each entry iterators were conceived
list_for_each_entry(pos, head, member)
The hlist ones were greedy and wanted an extra parameter:
hlist_for_each_entry(tpos, pos, head, member)
Why did they need an extra pos parameter? I'm not quite sure. Not only
they don't really need it, it also prevents the iterator from looking
exactly like the list iterator, which is unfortunate.Besides the semantic patch, there was some manual work required:
- Fix up the actual hlist iterators in linux/list.h
- Fix up the declaration of other iterators based on the hlist ones.
- A very small amount of places were using the 'node' parameter, this
was modified to use 'obj->member' instead.
- Coccinelle didn't handle the hlist_for_each_entry_safe iterator
properly, so those had to be fixed up manually.The semantic patch which is mostly the work of Peter Senna Tschudin is here:
@@
iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;type T;
expression a,c,d,e;
identifier b;
statement S;
@@-T b;
[akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
[akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foudnation.org: redo intrusive kvm changes]
Tested-by: Peter Senna Tschudin
Acked-by: Paul E. McKenney
Signed-off-by: Sasha Levin
Cc: Wu Fengguang
Cc: Marcelo Tosatti
Cc: Gleb Natapov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
MAX_IDR_MASK is another weirdness in the idr interface. As idr covers
whole positive integer range, it's defined as 0x7fffffff or INT_MAX.Its usage in idr_find(), idr_replace() and idr_remove() is bizarre.
They basically mask off the sign bit and operate on the rest, so if
the caller, by accident, passes in a negative number, the sign bit
will be masked off and the remaining part will be used as if that was
the input, which is worse than crashing.The constant is visible in idr.h and there are several users in the
kernel.* drivers/i2c/i2c-core.c:i2c_add_numbered_adapter()
Basically used to test if adap->nr is a negative number which isn't
-1 and returns -EINVAL if so. idr_alloc() already has negative
@start checking (w/ WARN_ON_ONCE), so this can go away.* drivers/infiniband/core/cm.c:cm_alloc_id()
drivers/infiniband/hw/mlx4/cm.c:id_map_alloc()Used to wrap cyclic @start. Can be replaced with max(next, 0).
Note that this type of cyclic allocation using idr is buggy. These
are prone to spurious -ENOSPC failure after the first wraparound.* fs/super.c:get_anon_bdev()
The ID allocated from ida is masked off before being tested whether
it's inside valid range. ida allocated ID can never be a negative
number and the masking is unnecessary.Update idr_*() functions to fail with -EINVAL when negative @id is
specified and update other MAX_IDR_MASK users as described above.This leaves MAX_IDR_MASK without any user, remove it and relocate
other MAX_IDR_* constants to lib/idr.c.Signed-off-by: Tejun Heo
Cc: Jean Delvare
Cc: Roland Dreier
Cc: Sean Hefty
Cc: Hal Rosenstock
Cc: "Marciniszyn, Mike"
Cc: Jack Morgenstein
Cc: Or Gerlitz
Cc: Al Viro
Acked-by: Wolfram Sang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
10 Oct, 2012
1 commit
-
Removed s_lock from super_block and removed lock/unlock super.
Signed-off-by: Marco Stornelli
Signed-off-by: Al Viro
06 Oct, 2012
1 commit
-
To avoid name conflicts:
drivers/video/riva/fbdev.c:281:9: sparse: preprocessor token MAX_LEVEL redefined
While at it, also make the other names more consistent and add
parentheses.[akpm@linux-foundation.org: repair fallout]
[sfr@canb.auug.org.au: IB/mlx4: fix for MAX_ID_MASK to MAX_IDR_MASK name change]
Signed-off-by: Fengguang Wu
Cc: Bernd Petrovitsch
Cc: walter harms
Cc: Glauber Costa
Signed-off-by: Stephen Rothwell
Cc: Roland Dreier
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
03 Oct, 2012
1 commit
-
There's no reason to call rcu_barrier() on every
deactivate_locked_super(). We only need to make sure that all delayed rcu
free inodes are flushed before we destroy related cache.Removing rcu_barrier() from deactivate_locked_super() affects some fast
paths. E.g. on my machine exit_group() of a last process in IPC
namespace takes 0.07538s. rcu_barrier() takes 0.05188s of that time.Signed-off-by: Kirill A. Shutemov
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Al Viro
04 Aug, 2012
1 commit
-
Finally we can kill the 'sync_supers' kernel thread along with the
'->write_super()' superblock operation because all the users are gone.
Now every file-system is supposed to self-manage own superblock and
its dirty state.The nice thing about killing this thread is that it improves power management.
Indeed, 'sync_supers' is a source of monotonic system wake-ups - it woke up
every 5 seconds no matter what - even if there were no dirty superblocks and
even if there were no file-systems using this service (e.g., btrfs and
journalled ext4 do not need it). So it was wasting power most of the time. And
because the thread was in the core of the kernel, all systems had to have it.
So I am quite happy to make it go away.Interestingly, this thread is a left-over from the pdflush kernel thread which
was a self-forking kernel thread responsible for all the write-back in old
Linux kernels. It was turned into per-block device BDI threads, and
'sync_supers' was a left-over. Thus, R.I.P, pdflush as well.Signed-off-by: Artem Bityutskiy
Signed-off-by: Al Viro
02 Aug, 2012
1 commit
-
Pull second vfs pile from Al Viro:
"The stuff in there: fsfreeze deadlock fixes by Jan (essentially, the
deadlock reproduced by xfstests 068), symlink and hardlink restriction
patches, plus assorted cleanups and fixes.Note that another fsfreeze deadlock (emergency thaw one) is *not*
dealt with - the series by Fernando conflicts a lot with Jan's, breaks
userland ABI (FIFREEZE semantics gets changed) and trades the deadlock
for massive vfsmount leak; this is going to be handled next cycle.
There probably will be another pull request, but that stuff won't be
in it."Fix up trivial conflicts due to unrelated changes next to each other in
drivers/{staging/gdm72xx/usb_boot.c, usb/gadget/storage_common.c}* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (54 commits)
delousing target_core_file a bit
Documentation: Correct s_umount state for freeze_fs/unfreeze_fs
fs: Remove old freezing mechanism
ext2: Implement freezing
btrfs: Convert to new freezing mechanism
nilfs2: Convert to new freezing mechanism
ntfs: Convert to new freezing mechanism
fuse: Convert to new freezing mechanism
gfs2: Convert to new freezing mechanism
ocfs2: Convert to new freezing mechanism
xfs: Convert to new freezing code
ext4: Convert to new freezing mechanism
fs: Protect write paths by sb_start_write - sb_end_write
fs: Skip atime update on frozen filesystem
fs: Add freezing handling to mnt_want_write() / mnt_drop_write()
fs: Improve filesystem freezing handling
switch the protection of percpu_counter list to spinlock
nfsd: Push mnt_want_write() outside of i_mutex
btrfs: Push mnt_want_write() outside of i_mutex
fat: Push mnt_want_write() outside of i_mutex
...
01 Aug, 2012
1 commit
-
09f363c7 ("vmscan: fix shrinker callback bug in fs/super.c") fixed a
shrinker callback which was returning -1 when nr_to_scan is zero, which
caused excessive slab scanning. But 635697c6 ("vmscan: fix initial
shrinker size handling") fixed the problem, again so we can freely return
-1 although nr_to_scan is zero. So let's revert 09f363c7 because the
comment added in 09f363c7 made an unnecessary rule.Signed-off-by: Minchan Kim
Cc: Al Viro
Cc: Mikulas Patocka
Cc: Konstantin Khlebnikov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
31 Jul, 2012
3 commits
-
Now that all users are converted, we can remove functions, variables, and
constants defined by the old freezing mechanism.BugLink: https://bugs.launchpad.net/bugs/897421
Tested-by: Kamal Mostafa
Tested-by: Peter M. Petrakis
Tested-by: Dann Frazier
Tested-by: Massimo Morana
Signed-off-by: Jan Kara
Signed-off-by: Al Viro -
vfs_check_frozen() tests are racy since the filesystem can be frozen just after
the test is performed. Thus in write paths we can end up marking some pages or
inodes dirty even though the file system is already frozen. This creates
problems with flusher thread hanging on frozen filesystem.Another problem is that exclusion between ->page_mkwrite() and filesystem
freezing has been handled by setting page dirty and then verifying s_frozen.
This guaranteed that either the freezing code sees the faulted page, writes it,
and writeprotects it again or we see s_frozen set and bail out of page fault.
This works to protect from page being marked writeable while filesystem
freezing is running but has an unpleasant artefact of leaving dirty (although
unmodified and writeprotected) pages on frozen filesystem resulting in similar
problems with flusher thread as the first problem.This patch aims at providing exclusion between write paths and filesystem
freezing. We implement a writer-freeze read-write semaphore in the superblock.
Actually, there are three such semaphores because of lock ranking reasons - one
for page fault handlers (->page_mkwrite), one for all other writers, and one of
internal filesystem purposes (used e.g. to track running transactions). Write
paths which should block freezing (e.g. directory operations, ->aio_write(),
->page_mkwrite) hold reader side of the semaphore. Code freezing the filesystem
takes the writer side.Only that we don't really want to bounce cachelines of the semaphores between
CPUs for each write happening. So we implement the reader side of the semaphore
as a per-cpu counter and the writer side is implemented using s_writers.frozen
superblock field.[AV: microoptimize sb_start_write(); we want it fast in normal case]
BugLink: https://bugs.launchpad.net/bugs/897421
Tested-by: Kamal Mostafa
Tested-by: Peter M. Petrakis
Tested-by: Dann Frazier
Tested-by: Massimo Morana
Signed-off-by: Jan Kara
Signed-off-by: Al Viro -
Pull writeback updates from Wu Fengguang:
"Use time based periods to age the writeback proportions, which can
adapt equally well to fast/slow devices."Fix up trivial conflict in comment in fs/sync.c
* tag 'writeback-proportions' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
writeback: Fix some comment errors
block: Convert BDI proportion calculations to flexible proportions
lib: Fix possible deadlock in flexible proportion code
lib: Proportions with flexible period