17 Dec, 2018
1 commit
-
[ Upstream commit c5a94f434c82529afda290df3235e4d85873c5b4 ]
It was observed that a process blocked indefintely in
__fscache_read_or_alloc_page(), waiting for FSCACHE_COOKIE_LOOKING_UP
to be cleared via fscache_wait_for_deferred_lookup().At this time, ->backing_objects was empty, which would normaly prevent
__fscache_read_or_alloc_page() from getting to the point of waiting.
This implies that ->backing_objects was cleared *after*
__fscache_read_or_alloc_page was was entered.When an object is "killed" and then "dropped",
FSCACHE_COOKIE_LOOKING_UP is cleared in fscache_lookup_failure(), then
KILL_OBJECT and DROP_OBJECT are "called" and only in DROP_OBJECT is
->backing_objects cleared. This leaves a window where
something else can set FSCACHE_COOKIE_LOOKING_UP and
__fscache_read_or_alloc_page() can start waiting, before
->backing_objects is clearedThere is some uncertainty in this analysis, but it seems to be fit the
observations. Adding the wake in this patch will be handled correctly
by __fscache_read_or_alloc_page(), as it checks if ->backing_objects
is empty again, after waiting.Customer which reported the hang, also report that the hang cannot be
reproduced with this fix.The backtrace for the blocked process looked like:
PID: 29360 TASK: ffff881ff2ac0f80 CPU: 3 COMMAND: "zsh"
#0 [ffff881ff43efbf8] schedule at ffffffff815e56f1
#1 [ffff881ff43efc58] bit_wait at ffffffff815e64ed
#2 [ffff881ff43efc68] __wait_on_bit at ffffffff815e61b8
#3 [ffff881ff43efca0] out_of_line_wait_on_bit at ffffffff815e625e
#4 [ffff881ff43efd08] fscache_wait_for_deferred_lookup at ffffffffa04f2e8f [fscache]
#5 [ffff881ff43efd18] __fscache_read_or_alloc_page at ffffffffa04f2ffe [fscache]
#6 [ffff881ff43efd58] __nfs_readpage_from_fscache at ffffffffa0679668 [nfs]
#7 [ffff881ff43efd78] nfs_readpage at ffffffffa067092b [nfs]
#8 [ffff881ff43efda0] generic_file_read_iter at ffffffff81187a73
#9 [ffff881ff43efe50] nfs_file_read at ffffffffa066544b [nfs]
#10 [ffff881ff43efe70] __vfs_read at ffffffff811fc756
#11 [ffff881ff43efee8] vfs_read at ffffffff811fccfa
#12 [ffff881ff43eff18] sys_read at ffffffff811fda62
#13 [ffff881ff43eff50] entry_SYSCALL_64_fastpath at ffffffff815e986eSigned-off-by: NeilBrown
Signed-off-by: David Howells
Signed-off-by: Sasha Levin
05 Sep, 2018
1 commit
-
[ Upstream commit d0eb06afe712b7b103b6361f40a9a0c638524669 ]
Alter the state-check assertion in fscache_enqueue_operation() to allow
cancelled operations to be given processing time so they can be cleaned up.Also fix a debugging statement that was requiring such operations to have
an object assigned.Fixes: 9ae326a69004 ("CacheFiles: A cache that backs onto a mounted filesystem")
Reported-by: Kiran Kumar Modukuri
Signed-off-by: David Howells
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman
30 May, 2018
1 commit
-
[ Upstream commit 2c98425720233ae3e135add0c7e869b32913502f ]
If the fscache asynchronous write operation elects to discard a page that's
pending storage to the cache because the page would be over the store limit
then it needs to wake the page as someone may be waiting on completion of
the write.The problem is that the store limit may be updated by a different
asynchronous operation - and so may miss the write - and that the store
limit may not even get updated until later by the netfs.Fix the kernel hang by making fscache_write_op() mark as written any pages
that are over the limit.Signed-off-by: David Howells
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman
02 Nov, 2017
1 commit
-
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.By default all files without license information are under the default
license of the kernel, which is GPL version 2.Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if
Reviewed-by: Philippe Ombredanne
Reviewed-by: Thomas Gleixner
Signed-off-by: Greg Kroah-Hartman
13 Oct, 2017
1 commit
-
When the file /proc/fs/fscache/objects (available with
CONFIG_FSCACHE_OBJECT_LIST=y) is opened, we request a user key with
description "fscache:objlist", then access its payload. However, a
revoked key has a NULL payload, and we failed to check for this.
request_key() *does* skip revoked keys, but there is still a window
where the key can be revoked before we access its payload.Fix it by checking for a NULL payload, treating it like a key which was
already revoked at the time it was requested.Fixes: 4fbf4291aa15 ("FS-Cache: Allow the current state of all objects to be dumped")
Reviewed-by: James Morris
Cc: [v2.6.32+]
Signed-off-by: Eric Biggers
Signed-off-by: David Howells
14 Sep, 2017
1 commit
-
gcc points out a minor bug in the handling of unknown cookie types,
which could result in a string overflow when the integer is copied into
a 3-byte string:fs/fscache/object-list.c: In function 'fscache_objlist_show':
fs/fscache/object-list.c:265:19: error: 'sprintf' may write a terminating nul past the end of the destination [-Werror=format-overflow=]
sprintf(_type, "%02u", cookie->def->type);
^~~~~~
fs/fscache/object-list.c:265:4: note: 'sprintf' output between 3 and 4 bytes into a destination of size 3This is currently harmless as no code sets a type other than 0 or 1, but
it makes sense to use snprintf() here to avoid overflowing the array if
that changes.Link: http://lkml.kernel.org/r/20170714120720.906842-22-arnd@arndb.de
Signed-off-by: Arnd Bergmann
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
07 Sep, 2017
2 commits
-
All users of pagevec_lookup() and pagevec_lookup_range() now pass
PAGEVEC_SIZE as a desired number of pages.Just drop the argument.
Link: http://lkml.kernel.org/r/20170726114704.7626-11-jack@suse.cz
Signed-off-by: Jan Kara
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Make pagevec_lookup() (and underlying find_get_pages()) update index to
the next page where iteration should continue. Most callers want this
and also pagevec_lookup_tag() already does this.Link: http://lkml.kernel.org/r/20170726114704.7626-3-jack@suse.cz
Signed-off-by: Jan Kara
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
02 Mar, 2017
1 commit
-
rcu_dereference_key() and user_key_payload() are currently being used in
two different, incompatible ways:(1) As a wrapper to rcu_dereference() - when only the RCU read lock used
to protect the key.(2) As a wrapper to rcu_dereference_protected() - when the key semaphor is
used to protect the key and the may be being modified.Fix this by splitting both of the key wrappers to produce:
(1) RCU accessors for keys when caller has the key semaphore locked:
dereference_key_locked()
user_key_payload_locked()(2) RCU accessors for keys when caller holds the RCU read lock:
dereference_key_rcu()
user_key_payload_rcu()This should fix following warning in the NFS idmapper
===============================
[ INFO: suspicious RCU usage. ]
4.10.0 #1 Tainted: G W
-------------------------------
./include/keys/user-type.h:53 suspicious rcu_dereference_protected() usage!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 0
1 lock held by mount.nfs/5987:
#0: (rcu_read_lock){......}, at: [] nfs_idmap_get_key+0x15c/0x420 [nfsv4]
stack backtrace:
CPU: 1 PID: 5987 Comm: mount.nfs Tainted: G W 4.10.0 #1
Call Trace:
dump_stack+0xe8/0x154 (unreliable)
lockdep_rcu_suspicious+0x140/0x190
nfs_idmap_get_key+0x380/0x420 [nfsv4]
nfs_map_name_to_uid+0x2a0/0x3b0 [nfsv4]
decode_getfattr_attrs+0xfac/0x16b0 [nfsv4]
decode_getfattr_generic.constprop.106+0xbc/0x150 [nfsv4]
nfs4_xdr_dec_lookup_root+0xac/0xb0 [nfsv4]
rpcauth_unwrap_resp+0xe8/0x140 [sunrpc]
call_decode+0x29c/0x910 [sunrpc]
__rpc_execute+0x140/0x8f0 [sunrpc]
rpc_run_task+0x170/0x200 [sunrpc]
nfs4_call_sync_sequence+0x68/0xa0 [nfsv4]
_nfs4_lookup_root.isra.44+0xd0/0xf0 [nfsv4]
nfs4_lookup_root+0xe0/0x350 [nfsv4]
nfs4_lookup_root_sec+0x70/0xa0 [nfsv4]
nfs4_find_root_sec+0xc4/0x100 [nfsv4]
nfs4_proc_get_rootfh+0x5c/0xf0 [nfsv4]
nfs4_get_rootfh+0x6c/0x190 [nfsv4]
nfs4_server_common_setup+0xc4/0x260 [nfsv4]
nfs4_create_server+0x278/0x3c0 [nfsv4]
nfs4_remote_mount+0x50/0xb0 [nfsv4]
mount_fs+0x74/0x210
vfs_kern_mount+0x78/0x220
nfs_do_root_mount+0xb0/0x140 [nfsv4]
nfs4_try_mount+0x60/0x100 [nfsv4]
nfs_fs_mount+0x5ec/0xda0 [nfs]
mount_fs+0x74/0x210
vfs_kern_mount+0x78/0x220
do_mount+0x254/0xf70
SyS_mount+0x94/0x100
system_call+0x38/0xe0Reported-by: Jan Stancek
Signed-off-by: David Howells
Tested-by: Jan Stancek
Signed-off-by: James Morris
01 Feb, 2017
3 commits
-
Under some circumstances, an fscache object can become queued such that it
fscache_object_work_func() can be called once the object is in the
OBJECT_DEAD state. This results in the kernel oopsing when it tries to
invoke the handler for the state (which is hard coded to 0x2).The way this comes about is something like the following:
(1) The object dispatcher is processing a work state for an object. This
is done in workqueue context.(2) An out-of-band event comes in that isn't masked, causing the object to
be queued, say EV_KILL.(3) The object dispatcher finishes processing the current work state on
that object and then sees there's another event to process, so,
without returning to the workqueue core, it processes that event too.
It then follows the chain of events that initiates until we reach
OBJECT_DEAD without going through a wait state (such as
WAIT_FOR_CLEARANCE).At this point, object->events may be 0, object->event_mask will be 0
and oob_event_mask will be 0.(4) The object dispatcher returns to the workqueue processor, and in due
course, this sees that the object's work item is still queued and
invokes it again.(5) The current state is a work state (OBJECT_DEAD), so the dispatcher
jumps to it - resulting in an OOPS.When I'm seeing this, the work state in (1) appears to have been either
LOOK_UP_OBJECT or CREATE_OBJECT (object->oob_table is
fscache_osm_lookup_oob).The window for (2) is very small:
(A) object->event_mask is cleared whilst the event dispatch process is
underway - though there's no memory barrier to force this to the top
of the function.The window, therefore is from the time the object was selected by the
workqueue processor and made requeueable to the time the mask was
cleared.(B) fscache_raise_event() will only queue the object if it manages to set
the event bit and the corresponding event_mask bit was set.The enqueuement is then deferred slightly whilst we get a ref on the
object and get the per-CPU variable for workqueue congestion. This
slight deferral slightly increases the probability by allowing extra
time for the workqueue to make the item requeueable.Handle this by giving the dead state a processor function and checking the
for the dead state address rather than seeing if the processor function is
address 0x2. The dead state processor function can then set a flag to
indicate that it's occurred and give a warning if it occurs more than once
per object.If this race occurs, an oops similar to the following is seen (note the RIP
value):BUG: unable to handle kernel NULL pointer dereference at 0000000000000002
IP: [] 0x1
PGD 0
Oops: 0010 [#1] SMP
Modules linked in: ...
CPU: 17 PID: 16077 Comm: kworker/u48:9 Not tainted 3.10.0-327.18.2.el7.x86_64 #1
Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 12/27/2015
Workqueue: fscache_object fscache_object_work_func [fscache]
task: ffff880302b63980 ti: ffff880717544000 task.ti: ffff880717544000
RIP: 0010:[] [] 0x1
RSP: 0018:ffff880717547df8 EFLAGS: 00010202
RAX: ffffffffa0368640 RBX: ffff880edf7a4480 RCX: dead000000200200
RDX: 0000000000000002 RSI: 00000000ffffffff RDI: ffff880edf7a4480
RBP: ffff880717547e18 R08: 0000000000000000 R09: dfc40a25cb3a4510
R10: dfc40a25cb3a4510 R11: 0000000000000400 R12: 0000000000000000
R13: ffff880edf7a4510 R14: ffff8817f6153400 R15: 0000000000000600
FS: 0000000000000000(0000) GS:ffff88181f420000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000002 CR3: 000000000194a000 CR4: 00000000001407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Stack:
ffffffffa0363695 ffff880edf7a4510 ffff88093f16f900 ffff8817faa4ec00
ffff880717547e60 ffffffff8109d5db 00000000faa4ec18 0000000000000000
ffff8817faa4ec18 ffff88093f16f930 ffff880302b63980 ffff88093f16f900
Call Trace:
[] ? fscache_object_work_func+0xa5/0x200 [fscache]
[] process_one_work+0x17b/0x470
[] worker_thread+0x21c/0x400
[] ? rescuer_thread+0x400/0x400
[] kthread+0xcf/0xe0
[] ? kthread_create_on_node+0x140/0x140
[] ret_from_fork+0x58/0x90
[] ? kthread_create_on_node+0x140/0x140Signed-off-by: David Howells
Acked-by: Jeremy McNicoll
Tested-by: Frank Sorenson
Tested-by: Benjamin Coddington
Reviewed-by: Benjamin Coddington
Signed-off-by: Al Viro -
fscache_disable_cookie() needs to clear the outstanding writes on the
cookie it's disabling because they cannot be completed after.Without this, fscache_nfs_open_file() gets stuck because it disables the
cookie when the file is opened for writing but can't uncache the pages till
afterwards - otherwise there's a race between the open routine and anyone
who already has it open R/O and is still reading from it.Looking in /proc/pid/stack of the offending process shows:
[] __fscache_wait_on_page_write+0x82/0x9b [fscache]
[] __fscache_uncache_all_inode_pages+0x91/0xe1 [fscache]
[] nfs_fscache_open_file+0x59/0x9e [nfs]
[] nfs4_file_open+0x17f/0x1b8 [nfsv4]
[] do_dentry_open+0x16d/0x2b7
[] vfs_open+0x5c/0x65
[] path_openat+0x785/0x8fb
[] do_filp_open+0x48/0x9e
[] do_sys_open+0x13b/0x1cb
[] SyS_open+0x19/0x1b
[] do_syscall_64+0x80/0x17a
[] return_from_SYSCALL_64+0x0/0x7a
[] 0xffffffffffffffffReported-by: Jianhong Yin
Signed-off-by: David Howells
Acked-by: Jeff Layton
Acked-by: Steve Dickson
Signed-off-by: Al Viro -
Initialise the stores_lock in fscache netfs cookies. Technically, it
shouldn't be necessary, since the netfs cookie is an index and stores no
data, but initialising it anyway adds insignificant overhead.Signed-off-by: David Howells
Reviewed-by: Jeff Layton
Acked-by: Steve Dickson
Signed-off-by: Al Viro
01 Jul, 2016
1 commit
01 Jun, 2016
1 commit
-
Signed-off-by: Yan, Zheng
Acked-by: David Howells
30 May, 2016
1 commit
-
it's not needed for file_operations of inodes located on fs defined
in the hosting module and for file_operations that go into procfs.Signed-off-by: Al Viro
05 Apr, 2016
1 commit
-
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.Let's stop pretending that pages in page cache are special. They are
not.The changes are pretty straight-forward:
- << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;
- >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)Signed-off-by: Kirill A. Shutemov
Acked-by: Michal Hocko
Signed-off-by: Linus Torvalds
11 Nov, 2015
3 commits
-
Handle a write being requested to the page immediately beyond the EOF
marker on a cache object. Currently this gets an assertion failure in
CacheFiles because the EOF marker is used there to encode information about
a partial page at the EOF - which could lead to an unknown blank spot in
the file if we extend the file over it.The problem is actually in fscache where we check the index of the page
being written against store_limit. store_limit is set to the number of
pages that we're allowed to store by fscache_set_store_limit() - which
means it's one more than the index of the last page we're allowed to store.
The problem is that we permit writing to a page with an index _equal_ to
the store limit - when we should reject that case.Whilst we're at it, change the triggered assertion in CacheFiles to just
return -ENOBUFS instead.The assertion failure looks something like this:
CacheFiles: Assertion failed
1000 < 7b1 is false
------------[ cut here ]------------
kernel BUG at fs/cachefiles/rdwr.c:962!
...
RIP: 0010:[] [] cachefiles_write_page+0x273/0x2d0 [cachefiles]Cc: stable@vger.kernel.org # v2.6.31+; earlier - that + backport of a17754f (at least)
Signed-off-by: David Howells
Signed-off-by: Al Viro -
Only override netfs->primary_index when registering success.
Cc: stable@vger.kernel.org # v2.6.30+
Signed-off-by: Kinglong Mee
Signed-off-by: David Howells
Signed-off-by: Al Viro -
If netfs exist, fscache should not increase the reference of parent's
usage and n_children, otherwise, never be decreased.v2: thanks David's suggest,
move increasing reference of parent if success
use kmem_cache_free() freeing primary_index directlyv3: don't move "netfs->primary_index->parent = &fscache_fsdef_index;"
Cc: stable@vger.kernel.org # v2.6.30+
Signed-off-by: Kinglong Mee
Signed-off-by: David Howells
Signed-off-by: Al Viro
07 Nov, 2015
1 commit
-
…d avoiding waking kswapd
__GFP_WAIT has been used to identify atomic context in callers that hold
spinlocks or are in interrupts. They are expected to be high priority and
have access one of two watermarks lower than "min" which can be referred
to as the "atomic reserve". __GFP_HIGH users get access to the first
lower watermark and can be called the "high priority reserve".Over time, callers had a requirement to not block when fallback options
were available. Some have abused __GFP_WAIT leading to a situation where
an optimisitic allocation with a fallback option can access atomic
reserves.This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
cannot sleep and have no alternative. High priority users continue to use
__GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and
are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify
callers that want to wake kswapd for background reclaim. __GFP_WAIT is
redefined as a caller that is willing to enter direct reclaim and wake
kswapd for background reclaim.This patch then converts a number of sites
o __GFP_ATOMIC is used by callers that are high priority and have memory
pools for those requests. GFP_ATOMIC uses this flag.o Callers that have a limited mempool to guarantee forward progress clear
__GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
into this category where kswapd will still be woken but atomic reserves
are not used as there is a one-entry mempool to guarantee progress.o Callers that are checking if they are non-blocking should use the
helper gfpflags_allow_blocking() where possible. This is because
checking for __GFP_WAIT as was done historically now can trigger false
positives. Some exceptions like dm-crypt.c exist where the code intent
is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
flag manipulations.o Callers that built their own GFP flags instead of starting with GFP_KERNEL
and friends now also need to specify __GFP_KSWAPD_RECLAIM.The first key hazard to watch out for is callers that removed __GFP_WAIT
and was depending on access to atomic reserves for inconspicuous reasons.
In some cases it may be appropriate for them to use __GFP_HIGH.The second key hazard is callers that assembled their own combination of
GFP flags instead of starting with something like GFP_KERNEL. They may
now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless
if it's missed in most cases as other activity will wake kswapd.Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
21 Oct, 2015
1 commit
-
Merge the type-specific data with the payload data into one four-word chunk
as it seems pointless to keep them separate.Use user_key_payload() for accessing the payloads of overloaded
user-defined keys.Signed-off-by: David Howells
cc: linux-cifs@vger.kernel.org
cc: ecryptfs@vger.kernel.org
cc: linux-ext4@vger.kernel.org
cc: linux-f2fs-devel@lists.sourceforge.net
cc: linux-nfs@vger.kernel.org
cc: ceph-devel@vger.kernel.org
cc: linux-ima-devel@lists.sourceforge.net
02 Apr, 2015
12 commits
-
Now that the retrieval operation may be disposed of by fscache_put_operation()
before we actually set the context, the retrieval-specific cleanup operation
can produce a NULL-pointer dereference when it tries to unconditionally clean
up the netfs context.Given that it is expected that we'll get at least as far as the place where we
currently set the context pointer and it is unlikely we'll go through the
error handling paths prior to that point, retain the context right from the
point that the retrieval op is allocated.Concomitant to this, we need to retain the cookie pointer in the retrieval op
also so that we can call the netfs to release its context in the release
method.In addition, we might now get into fscache_release_retrieval_op() with the op
only initialised. To this end, set the operation to DEAD only after the
release method has been called and skip the n_pages test upon cleanup if the
op is still in the INITIALISED state.Without these changes, the following oops might be seen:
BUG: unable to handle kernel NULL pointer dereference at 00000000000000b8
...
RIP: 0010:[] fscache_release_retrieval_op+0xae/0x100
...
Call Trace:
[] fscache_put_operation+0x117/0x2e0
[] __fscache_read_or_alloc_pages+0x351/0x3ac
[] __nfs_readpages_from_fscache+0x59/0xbf [nfs]
[] nfs_readpages+0x10c/0x185 [nfs]
[] ? alloc_pages_current+0x119/0x13e
[] ? __page_cache_alloc+0xfb/0x10a
[] __do_page_cache_readahead+0x188/0x22c
[] ondemand_readahead+0x29e/0x2af
[] page_cache_sync_readahead+0x38/0x3a
[] generic_file_read_iter+0x1a2/0x55a
[] ? nfs_revalidate_mapping+0xd6/0x288 [nfs]
[] nfs_file_read+0x49/0x70 [nfs]
[] new_sync_read+0x78/0x9c
[] __vfs_read+0x13/0x38
[] vfs_read+0x95/0x121
[] SyS_read+0x4c/0x8a
[] system_call_fastpath+0x12/0x17Signed-off-by: David Howells
Reviewed-by: Steve Dickson
Acked-by: Jeff Layton -
Any time an incomplete operation is cancelled, the operation cancellation
function needs to be called to clean up. This is currently being passed
directly to some of the functions that might want to call it, but not all.Instead, pass the cancellation method pointer to the fscache_operation_init()
and have that cache it in the operation struct. Further, plug in a dummy
cancellation handler if the caller declines to set one as this allows us to
call the function unconditionally (the extra overhead isn't worth bothering
about as we don't expect to be calling this typically).The cancellation method must thence be called everywhere the CANCELLED state
is set. Note that we call it *before* setting the CANCELLED state such that
the method can use the old state value to guide its operation.fscache_do_cancel_retrieval() needs moving higher up in the sources so that
the init function can use it now.Without this, the following oops may be seen:
FS-Cache: Assertion failed
FS-Cache: 3 == 0 is false
------------[ cut here ]------------
kernel BUG at ../fs/fscache/page.c:261!
...
RIP: 0010:[] fscache_release_retrieval_op+0x77/0x100
[] fscache_put_operation+0x114/0x2da
[] __fscache_read_or_alloc_pages+0x358/0x3b3
[] __nfs_readpages_from_fscache+0x59/0xbf [nfs]
[] nfs_readpages+0x10c/0x185 [nfs]
[] ? alloc_pages_current+0x119/0x13e
[] ? __page_cache_alloc+0xfb/0x10a
[] __do_page_cache_readahead+0x188/0x22c
[] ondemand_readahead+0x29e/0x2af
[] page_cache_sync_readahead+0x38/0x3a
[] generic_file_read_iter+0x1a2/0x55a
[] ? nfs_revalidate_mapping+0xd6/0x288 [nfs]
[] nfs_file_read+0x49/0x70 [nfs]
[] new_sync_read+0x78/0x9c
[] __vfs_read+0x13/0x38
[] vfs_read+0x95/0x121
[] SyS_read+0x4c/0x8a
[] system_call_fastpath+0x12/0x17The assertion is showing that the remaining number of pages (n_pages) is not 0
when the operation is being released.Signed-off-by: David Howells
Reviewed-by: Steve Dickson
Acked-by: Jeff Layton -
Call fscache_put_operation() or a wrapper on any op that has gone through
fscache_operation_init() so that the accounting shown in /proc is done
correctly, specifically fscache_n_op_release.fscache_put_operation() therefore now allows an op in the INITIALISED state as
well as in the CANCELLED and COMPLETE states.Note that this means that an operation can get put that doesn't have its
->object pointer filled in, so anything that depends on the object needs to be
conditional in fscache_put_operation().Signed-off-by: David Howells
Reviewed-by: Steve Dickson
Acked-by: Jeff Layton -
Cancellation of an in-progress operation needs to update the relevant counters
and start any operations that are pending waiting on this one.Signed-off-by: David Howells
Reviewed-by: Steve Dickson
Acked-by: Jeff Layton -
Count and display through /proc/fs/fscache/stats the number of initialised
operations.Signed-off-by: David Howells
Reviewed-by: Steve Dickson
Acked-by: Jeff Layton -
Out of line fscache_operation_init() so that it can access internal FS-Cache
features, such as stats, in a later commit.Signed-off-by: David Howells
Reviewed-by: Steve Dickson
Acked-by: Jeff Layton -
Currently, fscache_cancel_op() only cancels pending operations - attempts to
cancel in-progress operations are ignored. This leads to a problem in
fscache_wait_for_operation_activation() whereby the wait is terminated, but
the object has been killed.The check at the end of the function now triggers because it's no longer
contingent on the cache having produced an I/O error since the commit that
fixed the logic error in fscache_object_is_dead().The result of the check is that it tries to cancel the operation - but since
the object may not be pending by this point, the cancellation request may be
ignored - with the result that the the object is just put by the caller and
fscache_put_operation has an assertion failure because the operation isn't in
either the COMPLETE or the CANCELLED states.To fix this, we permit in-progress ops to be cancelled under some
circumstances.The bug results in an oops that looks something like this:
FS-Cache: fscache_wait_for_operation_activation() = -ENOBUFS [obj dead 3]
FS-Cache:
FS-Cache: Assertion failed
FS-Cache: 3 == 5 is false
------------[ cut here ]------------
kernel BUG at ../fs/fscache/operation.c:432!
...
RIP: 0010:[] fscache_put_operation+0xf2/0x2cd
Call Trace:
[] __fscache_read_or_alloc_pages+0x2ec/0x3b3
[] __nfs_readpages_from_fscache+0x59/0xbf [nfs]
[] nfs_readpages+0x10c/0x185 [nfs]
[] ? alloc_pages_current+0x119/0x13e
[] ? __page_cache_alloc+0xfb/0x10a
[] __do_page_cache_readahead+0x188/0x22c
[] ondemand_readahead+0x29e/0x2af
[] page_cache_sync_readahead+0x38/0x3a
[] generic_file_read_iter+0x1a2/0x55a
[] ? nfs_revalidate_mapping+0xd6/0x288 [nfs]
[] nfs_file_read+0x49/0x70 [nfs]
[] new_sync_read+0x78/0x9c
[] __vfs_read+0x13/0x38
[] vfs_read+0x95/0x121
[] SyS_read+0x4c/0x8a
[] system_call_fastpath+0x12/0x17Signed-off-by: David Howells
Reviewed-by: Steve Dickson
Acked-by: Jeff Layton -
fscache_object_is_dead() returns true only if the object is marked dead and
the cache got an I/O error. This should be a logical OR instead. Since two
of the callers got split up into handling for separate subcases, expand the
other callers and kill the function. This is probably the right thing to do
anyway since one of the subcases isn't about the object at all, but rather
about the cache.Signed-off-by: David Howells
Reviewed-by: Steve Dickson
Acked-by: Jeff Layton -
When an object is being marked as no longer live, do this under the object
spinlock to prevent a race with operation submission targeted on that object.The problem occurs due to the following pair of intertwined sequences when the
cache tries to create an object that would take it over the hard available
space limit:NETFS INTERFACE
===============
(A) The netfs calls fscache_acquire_cookie(). object creation is deferred to
the object state machine and the netfs is allowed to continue.OBJECT STATE MACHINE KTHREAD
============================
(1) The object is looked up on disk by fscache_look_up_object()
calling cachefiles_walk_to_object(). The latter finds that the
object is not yet represented on disk and calls
fscache_object_lookup_negative().(2) fscache_object_lookup_negative() sets FSCACHE_COOKIE_NO_DATA_YET
and clears FSCACHE_COOKIE_LOOKING_UP, thus allowing the netfs to
start queuing read operations.(B) The netfs calls fscache_read_or_alloc_pages(). This calls
fscache_wait_for_deferred_lookup() which sees FSCACHE_COOKIE_LOOKING_UP
become clear, allowing the read to begin.(C) A read operation is set up and passed to fscache_submit_op() to deal
with.(3) cachefiles_walk_to_object() calls cachefiles_has_space(), which
fails (or one of the file operations to create stuff fails).
cachefiles returns an error to fscache.(4) fscache_look_up_object() transits to the LOOKUP_FAILURE state,
(5) fscache_lookup_failure() sets FSCACHE_OBJECT_LOOKED_UP and
FSCACHE_COOKIE_UNAVAILABLE and clears FSCACHE_COOKIE_LOOKING_UP
then transits to the KILL_OBJECT state.(6) fscache_kill_object() clears FSCACHE_OBJECT_IS_LIVE in an attempt
to reject any further requests from the netfs.(7) object->n_ops is examined and found to be 0.
fscache_kill_object() transits to the DROP_OBJECT state.(D) fscache_submit_op() locks the object spinlock, sees if it can dispatch
the op immediately by calling fscache_object_is_active() - which fails
since FSCACHE_OBJECT_IS_AVAILABLE has not yet been set.(E) fscache_submit_op() then tests FSCACHE_OBJECT_LOOKED_UP - which is set.
It then queues the object and increments object->n_ops.(8) fscache_drop_object() releases the object and eventually
fscache_put_object() calls cachefiles_put_object() which suffers
an assertion failure here:ASSERTCMP(object->fscache.n_ops, ==, 0);
Locking the object spinlock in step (6) around the clearance of
FSCACHE_OBJECT_IS_LIVE ensures that the the decision trees in
fscache_submit_op() and fscache_submit_exclusive_op() don't see the IS_LIVE
flag being cleared mid-decision: either the op is queued before step (7) - in
which case fscache_kill_object() will see n_ops>0 and will deal with the op -
or the op will be rejected.This, combined with rejecting op submission if the target object is dying, fix
the problem.The problem shows up as the following oops:
CacheFiles: Assertion failed
CacheFiles: 1 == 0 is false
------------[ cut here ]------------
kernel BUG at ../fs/cachefiles/interface.c:339!
...
RIP: 0010:[] [] cachefiles_put_object+0x2a4/0x301 [cachefiles]
...
Call Trace:
[] fscache_put_object+0x18/0x21 [fscache]
[] fscache_object_work_func+0x3ba/0x3c9 [fscache]
[] process_one_work+0x226/0x441
[] worker_thread+0x273/0x36b
[] ? rescuer_thread+0x2e1/0x2e1
[] kthread+0x10e/0x116
[] ? kthread_create_on_node+0x1bb/0x1bb
[] ret_from_fork+0x7c/0xb0
[] ? kthread_create_on_node+0x1bb/0x1bbSigned-off-by: David Howells
Reviewed-by: Steve Dickson
Acked-by: Jeff Layton -
Reject new operations that are being submitted against an object if that
object has failed its lookup or creation states or has been killed by the
cache backend for some other reason, such as having been culled.Signed-off-by: David Howells
Reviewed-by: Steve Dickson
Acked-by: Jeff Layton -
When submitting an operation, prefer to cancel the operation immediately
rather than queuing it for later processing if the object is marked as dying
(ie. the object state machine has reached the KILL_OBJECT state).Whilst we're at it, change the series of related test_bit() calls into a
READ_ONCE() and bitwise-AND operators to reduce the number of load
instructions (test_bit() has a volatile address).Signed-off-by: David Howells
Reviewed-by: Steve Dickson
Acked-by: Jeff Layton -
Move fscache_report_unexpected_submission() up within operation.c so that it
can be called from fscache_submit_exclusive_op() too.Signed-off-by: David Howells
Reviewed-by: Steve Dickson
Acked-by: Jeff Layton
24 Feb, 2015
1 commit
-
Count the number of objects that get culled by the cache backend and the
number of objects that the cache backend declines to instantiate due to lack
of space in the cache.These numbers are made available through /proc/fs/fscache/stats
Signed-off-by: David Howells
Reviewed-by: Steve Dickson
Acked-by: Jeff Layton
14 Oct, 2014
1 commit
-
Reduce boilerplate code by using __seq_open_private() instead of seq_open()
in fscache_objlist_open().Signed-off-by: Rob Jones
Signed-off-by: David Howells
Acked-by: Steve Dickson
18 Sep, 2014
1 commit
-
In rare cases under heavy VMA pressure the ref count for a fscache cookie
becomes corrupt. In this case we decrement ref count even if we fail before
incrementing the refcount.FS-Cache: Assertion failed bnode-eca5f9c6/syslog
0 > 0 is false
------------[ cut here ]------------
kernel BUG at fs/fscache/cookie.c:519!
invalid opcode: 0000 [#1] SMP
Call Trace:
[] __fscache_relinquish_cookie+0x50/0x220 [fscache]
[] ceph_fscache_unregister_inode_cookie+0x3e/0x50 [ceph]
[] ceph_destroy_inode+0x33/0x200 [ceph]
[] ? __fsnotify_inode_delete+0xe/0x10
[] destroy_inode+0x3c/0x70
[] evict+0x111/0x180
[] iput+0x103/0x190
[] __dentry_kill+0x1c8/0x220
[] shrink_dentry_list+0xf1/0x250
[] prune_dcache_sb+0x4c/0x60
[] super_cache_scan+0xff/0x170
[] shrink_slab_node+0x140/0x2c0
[] shrink_slab+0x8a/0x130
[] balance_pgdat+0x3e2/0x5d0
[] kswapd+0x16a/0x4a0
[] ? __wake_up_sync+0x20/0x20
[] ? balance_pgdat+0x5d0/0x5d0
[] kthread+0xc9/0xe0
[] ? ftrace_raw_event_xen_mmu_release_ptpage+0x70/0x90
[] ? flush_kthread_worker+0xb0/0xb0
[] ret_from_fork+0x7c/0xb0
[] ? flush_kthread_worker+0xb0/0xb0
RIP [] __fscache_disable_cookie+0x1db/0x210 [fscache]
RSP
---[ end trace 254d0d7c74a01f25 ]---Signed-off-by: Milosz Tanski
Signed-off-by: David Howells
27 Aug, 2014
2 commits
-
I've been seeing issues with disposing cookies under vma pressure. The symptom
is that the refcount gets out of sync. In this case we fail to decrement the
refcount if submit fails. I found this while auditing the error in and around
cookie operations.Signed-off-by: Milosz Tanski
Signed-off-by: David Howells -
This is meant to avoid a recusive hang caused by underlying filesystem trying
to grab a free page and causing a write-out.INFO: task kworker/u30:7:28375 blocked for more than 120 seconds.
Not tainted 3.15.0-virtual #74
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kworker/u30:7 D 0000000000000000 0 28375 2 0x00000000
Workqueue: fscache_operation fscache_op_work_func [fscache]
ffff88000b147148 0000000000000046 0000000000000000 ffff88000b1471c8
ffff8807aa031820 0000000000014040 ffff88000b147fd8 0000000000014040
ffff880f0c50c860 ffff8807aa031820 ffff88000b147158 ffff88007be59cd0
Call Trace:
[] schedule+0x29/0x70
[] __fscache_wait_on_page_write+0x55/0x90 [fscache]
[] ? __wake_up_sync+0x20/0x20
[] __fscache_maybe_release_page+0x65/0x1e0 [fscache]
[] ceph_releasepage+0x83/0x100 [ceph]
[] ? anon_vma_fork+0x130/0x130
[] try_to_release_page+0x32/0x50
[] shrink_page_list+0x7e6/0x9d0
[] ? isolate_lru_pages.isra.73+0x78/0x1e0
[] shrink_inactive_list+0x252/0x4c0
[] shrink_lruvec+0x3e1/0x670
[] shrink_zone+0x3f/0x110
[] do_try_to_free_pages+0x1d6/0x450
[] ? zone_statistics+0x99/0xc0
[] try_to_free_pages+0xc4/0x180
[] __alloc_pages_nodemask+0x6b2/0xa60
[] ? __find_get_block+0xbe/0x250
[] ? wake_up_bit+0x2e/0x40
[] alloc_pages_current+0xb3/0x180
[] __page_cache_alloc+0xb7/0xd0
[] grab_cache_page_write_begin+0x7c/0xe0
[] ? ext4_mark_inode_dirty+0x82/0x220
[] ext4_da_write_begin+0x89/0x2d0
[] generic_perform_write+0xbe/0x1d0
[] ? update_time+0x81/0xc0
[] ? mnt_clone_write+0x12/0x30
[] __generic_file_aio_write+0x1ce/0x3f0
[] generic_file_aio_write+0x5e/0xe0
[] ext4_file_write+0x9f/0x410
[] ? ext4_file_open+0x66/0x180
[] do_sync_write+0x5a/0x90
[] cachefiles_write_page+0x149/0x430 [cachefiles]
[] ? radix_tree_gang_lookup_tag+0x89/0xd0
[] fscache_write_op+0x222/0x3b0 [fscache]
[] fscache_op_work_func+0x3a/0x100 [fscache]
[] process_one_work+0x179/0x4a0
[] worker_thread+0x11b/0x370
[] ? manage_workers.isra.21+0x2e0/0x2e0
[] kthread+0xc9/0xe0
[] ? ftrace_raw_event_xen_mmu_release_ptpage+0x70/0x90
[] ? flush_kthread_worker+0xb0/0xb0
[] ret_from_fork+0x7c/0xb0
[] ? flush_kthread_worker+0xb0/0xb0Signed-off-by: Milosz Tanski
Signed-off-by: David Howells
07 Aug, 2014
1 commit
-
fscache_sysctls and fscache_sysctls_root are only used in main.c
Signed-off-by: Fabian Frederick
Cc: David Howells
Cc: Joe Perches
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
16 Jul, 2014
1 commit
-
The current "wait_on_bit" interface requires an 'action'
function to be provided which does the actual waiting.
There are over 20 such functions, many of them identical.
Most cases can be satisfied by one of just two functions, one
which uses io_schedule() and one which just uses schedule().So:
Rename wait_on_bit and wait_on_bit_lock to
wait_on_bit_action and wait_on_bit_lock_action
to make it explicit that they need an action function.Introduce new wait_on_bit{,_lock} and wait_on_bit{,_lock}_io
which are *not* given an action function but implicitly use
a standard one.
The decision to error-out if a signal is pending is now made
based on the 'mode' argument rather than being encoded in the action
function.All instances of the old wait_on_bit and wait_on_bit_lock which
can use the new version have been changed accordingly and their
action functions have been discarded.
wait_on_bit{_lock} does not return any specific error code in the
event of a signal so the caller must check for non-zero and
interpolate their own error code as appropriate.The wait_on_bit() call in __fscache_wait_on_invalidate() was
ambiguous as it specified TASK_UNINTERRUPTIBLE but used
fscache_wait_bit_interruptible as an action function.
David Howells confirms this should be uniformly
"uninterruptible"The main remaining user of wait_on_bit{,_lock}_action is NFS
which needs to use a freezer-aware schedule() call.A comment in fs/gfs2/glock.c notes that having multiple 'action'
functions is useful as they display differently in the 'wchan'
field of 'ps'. (and /proc/$PID/wchan).
As the new bit_wait{,_io} functions are tagged "__sched", they
will not show up at all, but something higher in the stack. So
the distinction will still be visible, only with different
function names (gds2_glock_wait versus gfs2_glock_dq_wait in the
gfs2/glock.c case).Since first version of this patch (against 3.15) two new action
functions appeared, on in NFS and one in CIFS. CIFS also now
uses an action function that makes the same freezer aware
schedule call as NFS.Signed-off-by: NeilBrown
Acked-by: David Howells (fscache, keys)
Acked-by: Steven Whitehouse (gfs2)
Acked-by: Peter Zijlstra
Cc: Oleg Nesterov
Cc: Steve French
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/20140707051603.28027.72349.stgit@notabene.brown
Signed-off-by: Ingo Molnar