31 May, 2019

1 commit

  • Based on 1 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license as published by
    the free software foundation either version 2 of the license or at
    your option any later version

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-or-later

    has been chosen to replace the boilerplate/reference in 3029 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Allison Randal
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

30 Nov, 2018

1 commit

  • It was observed that a process blocked indefintely in
    __fscache_read_or_alloc_page(), waiting for FSCACHE_COOKIE_LOOKING_UP
    to be cleared via fscache_wait_for_deferred_lookup().

    At this time, ->backing_objects was empty, which would normaly prevent
    __fscache_read_or_alloc_page() from getting to the point of waiting.
    This implies that ->backing_objects was cleared *after*
    __fscache_read_or_alloc_page was was entered.

    When an object is "killed" and then "dropped",
    FSCACHE_COOKIE_LOOKING_UP is cleared in fscache_lookup_failure(), then
    KILL_OBJECT and DROP_OBJECT are "called" and only in DROP_OBJECT is
    ->backing_objects cleared. This leaves a window where
    something else can set FSCACHE_COOKIE_LOOKING_UP and
    __fscache_read_or_alloc_page() can start waiting, before
    ->backing_objects is cleared

    There is some uncertainty in this analysis, but it seems to be fit the
    observations. Adding the wake in this patch will be handled correctly
    by __fscache_read_or_alloc_page(), as it checks if ->backing_objects
    is empty again, after waiting.

    Customer which reported the hang, also report that the hang cannot be
    reproduced with this fix.

    The backtrace for the blocked process looked like:

    PID: 29360 TASK: ffff881ff2ac0f80 CPU: 3 COMMAND: "zsh"
    #0 [ffff881ff43efbf8] schedule at ffffffff815e56f1
    #1 [ffff881ff43efc58] bit_wait at ffffffff815e64ed
    #2 [ffff881ff43efc68] __wait_on_bit at ffffffff815e61b8
    #3 [ffff881ff43efca0] out_of_line_wait_on_bit at ffffffff815e625e
    #4 [ffff881ff43efd08] fscache_wait_for_deferred_lookup at ffffffffa04f2e8f [fscache]
    #5 [ffff881ff43efd18] __fscache_read_or_alloc_page at ffffffffa04f2ffe [fscache]
    #6 [ffff881ff43efd58] __nfs_readpage_from_fscache at ffffffffa0679668 [nfs]
    #7 [ffff881ff43efd78] nfs_readpage at ffffffffa067092b [nfs]
    #8 [ffff881ff43efda0] generic_file_read_iter at ffffffff81187a73
    #9 [ffff881ff43efe50] nfs_file_read at ffffffffa066544b [nfs]
    #10 [ffff881ff43efe70] __vfs_read at ffffffff811fc756
    #11 [ffff881ff43efee8] vfs_read at ffffffff811fccfa
    #12 [ffff881ff43eff18] sys_read at ffffffff811fda62
    #13 [ffff881ff43eff50] entry_SYSCALL_64_fastpath at ffffffff815e986e

    Signed-off-by: NeilBrown
    Signed-off-by: David Howells

    NeilBrown
     

25 Jul, 2018

1 commit

  • When a cookie is allocated that causes fscache_object structs to be
    allocated, those objects are initialised with the cookie pointer, but
    aren't blessed with a ref on that cookie unless the attachment is
    successfully completed in fscache_attach_object().

    If attachment fails because the parent object was dying or there was a
    collision, fscache_attach_object() returns without incrementing the cookie
    counter - but upon failure of this function, the object is released which
    then puts the cookie, whether or not a ref was taken on the cookie.

    Fix this by taking a ref on the cookie when it is assigned in
    fscache_object_init(), even when we're creating a root object.

    Analysis from Kiran Kumar:

    This bug has been seen in 4.4.0-124-generic #148-Ubuntu kernel

    BugLink: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1776277

    fscache cookie ref count updated incorrectly during fscache object
    allocation resulting in following Oops.

    kernel BUG at /build/linux-Y09MKI/linux-4.4.0/fs/fscache/internal.h:321!
    kernel BUG at /build/linux-Y09MKI/linux-4.4.0/fs/fscache/cookie.c:639!

    [Cause]
    Two threads are trying to do operate on a cookie and two objects.

    (1) One thread tries to unmount the filesystem and in process goes over a
    huge list of objects marking them dead and deleting the objects.
    cookie->usage is also decremented in following path:

    nfs_fscache_release_super_cookie
    -> __fscache_relinquish_cookie
    ->__fscache_cookie_put
    ->BUG_ON(atomic_read(&cookie->usage) fscache_object_init
    -> assign cookie, but usage not bumped.
    2) fscache_attach_object -> fails in cant_attach_object because the
    cookie's backing object or cookie's->parent object are going away
    3) fscache_put_object
    -> cachefiles_put_object
    ->fscache_object_destroy
    ->fscache_cookie_put
    ->BUG_ON(atomic_read(&cookie->usage)
    Signed-off-by: David Howells

    Kiran Kumar Modukuri
     

12 Apr, 2018

1 commit

  • Don't open-code accesses to data structure internals.

    Link: http://lkml.kernel.org/r/20180313132639.17387-7-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Reviewed-by: Jeff Layton
    Cc: Darrick J. Wong
    Cc: Dave Chinner
    Cc: Ryusuke Konishi
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     

06 Apr, 2018

1 commit

  • Pass the object size in to fscache_acquire_cookie() and
    fscache_write_page() rather than the netfs providing a callback by which it
    can be received. This makes it easier to update the size of the object
    when a new page is written that extends the object.

    The current object size is also passed by fscache to the check_aux
    function, obviating the need to store it in the aux data.

    Signed-off-by: David Howells
    Acked-by: Anna Schumaker
    Tested-by: Steve Dickson

    David Howells
     

04 Apr, 2018

3 commits

  • Attach copies of the index key and auxiliary data to the fscache cookie so
    that:

    (1) The callbacks to the netfs for this stuff can be eliminated. This
    can simplify things in the cache as the information is still
    available, even after the cache has relinquished the cookie.

    (2) Simplifies the locking requirements of accessing the information as we
    don't have to worry about the netfs object going away on us.

    (3) The cache can do lazy updating of the coherency information on disk.
    As long as the cache is flushed before reboot/poweroff, there's no
    need to update the coherency info on disk every time it changes.

    (4) Cookies can be hashed or put in a tree as the index key is easily
    available. This allows:

    (a) Checks for duplicate cookies can be made at the top fscache layer
    rather than down in the bowels of the cache backend.

    (b) Caching can be added to a netfs object that has a cookie if the
    cache is brought online after the netfs object is allocated.

    A certain amount of space is made in the cookie for inline copies of the
    data, but if it won't fit there, extra memory will be allocated for it.

    The downside of this is that live cache operation requires more memory.

    Signed-off-by: David Howells
    Acked-by: Anna Schumaker
    Tested-by: Steve Dickson

    David Howells
     
  • Add more tracepoints to fscache, including:

    (*) fscache_page - Tracks netfs pages known to fscache.

    (*) fscache_check_page - Tracks the netfs querying whether a page is
    pending storage.

    (*) fscache_wake_cookie - Tracks cookies being woken up after a page
    completes/aborts storage in the cache.

    (*) fscache_op - Tracks operations being initialised.

    (*) fscache_wrote_page - Tracks return of the backend write_page op.

    (*) fscache_gang_lookup - Tracks lookup of pages to be stored in the write
    operation.

    Signed-off-by: David Howells

    David Howells
     
  • Add some tracepoints to fscache:

    (*) fscache_cookie - Tracks a cookie's usage count.

    (*) fscache_netfs - Logs registration of a network filesystem, including
    the pointer to the cookie allocated.

    (*) fscache_acquire - Logs cookie acquisition.

    (*) fscache_relinquish - Logs cookie relinquishment.

    (*) fscache_enable - Logs enablement of a cookie.

    (*) fscache_disable - Logs disablement of a cookie.

    (*) fscache_osm - Tracks execution of states in the object state machine.

    and cachefiles:

    (*) cachefiles_ref - Tracks a cachefiles object's usage count.

    (*) cachefiles_lookup - Logs result of lookup_one_len().

    (*) cachefiles_mkdir - Logs result of vfs_mkdir().

    (*) cachefiles_create - Logs result of vfs_create().

    (*) cachefiles_unlink - Logs calls to vfs_unlink().

    (*) cachefiles_rename - Logs calls to vfs_rename().

    (*) cachefiles_mark_active - Logs an object becoming active.

    (*) cachefiles_wait_active - Logs a wait for an old object to be
    destroyed.

    (*) cachefiles_mark_inactive - Logs an object becoming inactive.

    (*) cachefiles_mark_buried - Logs the burial of an object.

    Signed-off-by: David Howells

    David Howells
     

01 Feb, 2017

2 commits

  • Under some circumstances, an fscache object can become queued such that it
    fscache_object_work_func() can be called once the object is in the
    OBJECT_DEAD state. This results in the kernel oopsing when it tries to
    invoke the handler for the state (which is hard coded to 0x2).

    The way this comes about is something like the following:

    (1) The object dispatcher is processing a work state for an object. This
    is done in workqueue context.

    (2) An out-of-band event comes in that isn't masked, causing the object to
    be queued, say EV_KILL.

    (3) The object dispatcher finishes processing the current work state on
    that object and then sees there's another event to process, so,
    without returning to the workqueue core, it processes that event too.
    It then follows the chain of events that initiates until we reach
    OBJECT_DEAD without going through a wait state (such as
    WAIT_FOR_CLEARANCE).

    At this point, object->events may be 0, object->event_mask will be 0
    and oob_event_mask will be 0.

    (4) The object dispatcher returns to the workqueue processor, and in due
    course, this sees that the object's work item is still queued and
    invokes it again.

    (5) The current state is a work state (OBJECT_DEAD), so the dispatcher
    jumps to it - resulting in an OOPS.

    When I'm seeing this, the work state in (1) appears to have been either
    LOOK_UP_OBJECT or CREATE_OBJECT (object->oob_table is
    fscache_osm_lookup_oob).

    The window for (2) is very small:

    (A) object->event_mask is cleared whilst the event dispatch process is
    underway - though there's no memory barrier to force this to the top
    of the function.

    The window, therefore is from the time the object was selected by the
    workqueue processor and made requeueable to the time the mask was
    cleared.

    (B) fscache_raise_event() will only queue the object if it manages to set
    the event bit and the corresponding event_mask bit was set.

    The enqueuement is then deferred slightly whilst we get a ref on the
    object and get the per-CPU variable for workqueue congestion. This
    slight deferral slightly increases the probability by allowing extra
    time for the workqueue to make the item requeueable.

    Handle this by giving the dead state a processor function and checking the
    for the dead state address rather than seeing if the processor function is
    address 0x2. The dead state processor function can then set a flag to
    indicate that it's occurred and give a warning if it occurs more than once
    per object.

    If this race occurs, an oops similar to the following is seen (note the RIP
    value):

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000002
    IP: [] 0x1
    PGD 0
    Oops: 0010 [#1] SMP
    Modules linked in: ...
    CPU: 17 PID: 16077 Comm: kworker/u48:9 Not tainted 3.10.0-327.18.2.el7.x86_64 #1
    Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 12/27/2015
    Workqueue: fscache_object fscache_object_work_func [fscache]
    task: ffff880302b63980 ti: ffff880717544000 task.ti: ffff880717544000
    RIP: 0010:[] [] 0x1
    RSP: 0018:ffff880717547df8 EFLAGS: 00010202
    RAX: ffffffffa0368640 RBX: ffff880edf7a4480 RCX: dead000000200200
    RDX: 0000000000000002 RSI: 00000000ffffffff RDI: ffff880edf7a4480
    RBP: ffff880717547e18 R08: 0000000000000000 R09: dfc40a25cb3a4510
    R10: dfc40a25cb3a4510 R11: 0000000000000400 R12: 0000000000000000
    R13: ffff880edf7a4510 R14: ffff8817f6153400 R15: 0000000000000600
    FS: 0000000000000000(0000) GS:ffff88181f420000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000000002 CR3: 000000000194a000 CR4: 00000000001407e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Stack:
    ffffffffa0363695 ffff880edf7a4510 ffff88093f16f900 ffff8817faa4ec00
    ffff880717547e60 ffffffff8109d5db 00000000faa4ec18 0000000000000000
    ffff8817faa4ec18 ffff88093f16f930 ffff880302b63980 ffff88093f16f900
    Call Trace:
    [] ? fscache_object_work_func+0xa5/0x200 [fscache]
    [] process_one_work+0x17b/0x470
    [] worker_thread+0x21c/0x400
    [] ? rescuer_thread+0x400/0x400
    [] kthread+0xcf/0xe0
    [] ? kthread_create_on_node+0x140/0x140
    [] ret_from_fork+0x58/0x90
    [] ? kthread_create_on_node+0x140/0x140

    Signed-off-by: David Howells
    Acked-by: Jeremy McNicoll
    Tested-by: Frank Sorenson
    Tested-by: Benjamin Coddington
    Reviewed-by: Benjamin Coddington
    Signed-off-by: Al Viro

    David Howells
     
  • fscache_disable_cookie() needs to clear the outstanding writes on the
    cookie it's disabling because they cannot be completed after.

    Without this, fscache_nfs_open_file() gets stuck because it disables the
    cookie when the file is opened for writing but can't uncache the pages till
    afterwards - otherwise there's a race between the open routine and anyone
    who already has it open R/O and is still reading from it.

    Looking in /proc/pid/stack of the offending process shows:

    [] __fscache_wait_on_page_write+0x82/0x9b [fscache]
    [] __fscache_uncache_all_inode_pages+0x91/0xe1 [fscache]
    [] nfs_fscache_open_file+0x59/0x9e [nfs]
    [] nfs4_file_open+0x17f/0x1b8 [nfsv4]
    [] do_dentry_open+0x16d/0x2b7
    [] vfs_open+0x5c/0x65
    [] path_openat+0x785/0x8fb
    [] do_filp_open+0x48/0x9e
    [] do_sys_open+0x13b/0x1cb
    [] SyS_open+0x19/0x1b
    [] do_syscall_64+0x80/0x17a
    [] return_from_SYSCALL_64+0x0/0x7a
    [] 0xffffffffffffffff

    Reported-by: Jianhong Yin
    Signed-off-by: David Howells
    Acked-by: Jeff Layton
    Acked-by: Steve Dickson
    Signed-off-by: Al Viro

    David Howells
     

02 Apr, 2015

3 commits

  • Any time an incomplete operation is cancelled, the operation cancellation
    function needs to be called to clean up. This is currently being passed
    directly to some of the functions that might want to call it, but not all.

    Instead, pass the cancellation method pointer to the fscache_operation_init()
    and have that cache it in the operation struct. Further, plug in a dummy
    cancellation handler if the caller declines to set one as this allows us to
    call the function unconditionally (the extra overhead isn't worth bothering
    about as we don't expect to be calling this typically).

    The cancellation method must thence be called everywhere the CANCELLED state
    is set. Note that we call it *before* setting the CANCELLED state such that
    the method can use the old state value to guide its operation.

    fscache_do_cancel_retrieval() needs moving higher up in the sources so that
    the init function can use it now.

    Without this, the following oops may be seen:

    FS-Cache: Assertion failed
    FS-Cache: 3 == 0 is false
    ------------[ cut here ]------------
    kernel BUG at ../fs/fscache/page.c:261!
    ...
    RIP: 0010:[] fscache_release_retrieval_op+0x77/0x100
    [] fscache_put_operation+0x114/0x2da
    [] __fscache_read_or_alloc_pages+0x358/0x3b3
    [] __nfs_readpages_from_fscache+0x59/0xbf [nfs]
    [] nfs_readpages+0x10c/0x185 [nfs]
    [] ? alloc_pages_current+0x119/0x13e
    [] ? __page_cache_alloc+0xfb/0x10a
    [] __do_page_cache_readahead+0x188/0x22c
    [] ondemand_readahead+0x29e/0x2af
    [] page_cache_sync_readahead+0x38/0x3a
    [] generic_file_read_iter+0x1a2/0x55a
    [] ? nfs_revalidate_mapping+0xd6/0x288 [nfs]
    [] nfs_file_read+0x49/0x70 [nfs]
    [] new_sync_read+0x78/0x9c
    [] __vfs_read+0x13/0x38
    [] vfs_read+0x95/0x121
    [] SyS_read+0x4c/0x8a
    [] system_call_fastpath+0x12/0x17

    The assertion is showing that the remaining number of pages (n_pages) is not 0
    when the operation is being released.

    Signed-off-by: David Howells
    Reviewed-by: Steve Dickson
    Acked-by: Jeff Layton

    David Howells
     
  • When an object is being marked as no longer live, do this under the object
    spinlock to prevent a race with operation submission targeted on that object.

    The problem occurs due to the following pair of intertwined sequences when the
    cache tries to create an object that would take it over the hard available
    space limit:

    NETFS INTERFACE
    ===============
    (A) The netfs calls fscache_acquire_cookie(). object creation is deferred to
    the object state machine and the netfs is allowed to continue.

    OBJECT STATE MACHINE KTHREAD
    ============================
    (1) The object is looked up on disk by fscache_look_up_object()
    calling cachefiles_walk_to_object(). The latter finds that the
    object is not yet represented on disk and calls
    fscache_object_lookup_negative().

    (2) fscache_object_lookup_negative() sets FSCACHE_COOKIE_NO_DATA_YET
    and clears FSCACHE_COOKIE_LOOKING_UP, thus allowing the netfs to
    start queuing read operations.

    (B) The netfs calls fscache_read_or_alloc_pages(). This calls
    fscache_wait_for_deferred_lookup() which sees FSCACHE_COOKIE_LOOKING_UP
    become clear, allowing the read to begin.

    (C) A read operation is set up and passed to fscache_submit_op() to deal
    with.

    (3) cachefiles_walk_to_object() calls cachefiles_has_space(), which
    fails (or one of the file operations to create stuff fails).
    cachefiles returns an error to fscache.

    (4) fscache_look_up_object() transits to the LOOKUP_FAILURE state,

    (5) fscache_lookup_failure() sets FSCACHE_OBJECT_LOOKED_UP and
    FSCACHE_COOKIE_UNAVAILABLE and clears FSCACHE_COOKIE_LOOKING_UP
    then transits to the KILL_OBJECT state.

    (6) fscache_kill_object() clears FSCACHE_OBJECT_IS_LIVE in an attempt
    to reject any further requests from the netfs.

    (7) object->n_ops is examined and found to be 0.
    fscache_kill_object() transits to the DROP_OBJECT state.

    (D) fscache_submit_op() locks the object spinlock, sees if it can dispatch
    the op immediately by calling fscache_object_is_active() - which fails
    since FSCACHE_OBJECT_IS_AVAILABLE has not yet been set.

    (E) fscache_submit_op() then tests FSCACHE_OBJECT_LOOKED_UP - which is set.
    It then queues the object and increments object->n_ops.

    (8) fscache_drop_object() releases the object and eventually
    fscache_put_object() calls cachefiles_put_object() which suffers
    an assertion failure here:

    ASSERTCMP(object->fscache.n_ops, ==, 0);

    Locking the object spinlock in step (6) around the clearance of
    FSCACHE_OBJECT_IS_LIVE ensures that the the decision trees in
    fscache_submit_op() and fscache_submit_exclusive_op() don't see the IS_LIVE
    flag being cleared mid-decision: either the op is queued before step (7) - in
    which case fscache_kill_object() will see n_ops>0 and will deal with the op -
    or the op will be rejected.

    This, combined with rejecting op submission if the target object is dying, fix
    the problem.

    The problem shows up as the following oops:

    CacheFiles: Assertion failed
    CacheFiles: 1 == 0 is false
    ------------[ cut here ]------------
    kernel BUG at ../fs/cachefiles/interface.c:339!
    ...
    RIP: 0010:[] [] cachefiles_put_object+0x2a4/0x301 [cachefiles]
    ...
    Call Trace:
    [] fscache_put_object+0x18/0x21 [fscache]
    [] fscache_object_work_func+0x3ba/0x3c9 [fscache]
    [] process_one_work+0x226/0x441
    [] worker_thread+0x273/0x36b
    [] ? rescuer_thread+0x2e1/0x2e1
    [] kthread+0x10e/0x116
    [] ? kthread_create_on_node+0x1bb/0x1bb
    [] ret_from_fork+0x7c/0xb0
    [] ? kthread_create_on_node+0x1bb/0x1bb

    Signed-off-by: David Howells
    Reviewed-by: Steve Dickson
    Acked-by: Jeff Layton

    David Howells
     
  • Reject new operations that are being submitted against an object if that
    object has failed its lookup or creation states or has been killed by the
    cache backend for some other reason, such as having been culled.

    Signed-off-by: David Howells
    Reviewed-by: Steve Dickson
    Acked-by: Jeff Layton

    David Howells
     

24 Feb, 2015

1 commit


27 Aug, 2014

1 commit

  • I've been seeing issues with disposing cookies under vma pressure. The symptom
    is that the refcount gets out of sync. In this case we fail to decrement the
    refcount if submit fails. I found this while auditing the error in and around
    cookie operations.

    Signed-off-by: Milosz Tanski
    Signed-off-by: David Howells

    Milosz Tanski
     

18 Feb, 2014

1 commit

  • When FS-Cache allocates an object, the following sequence of events can
    occur:

    -->fscache_alloc_object()
    -->cachefiles_alloc_object() [via cache->ops->alloc_object]
    fscache_attach_object()
    cachefiles_put_object() [via cache->ops->put_object]
    -->fscache_object_destroy()
    -->fscache_objlist_remove()
    -->rb_erase() to remove the object from fscache_object_list.

    resulting in a crash in the rbtree code.

    The problem is that the object is only added to fscache_object_list on
    the success path of fscache_attach_object() where it calls
    fscache_objlist_add().

    So if fscache_attach_object() fails, the object won't have been added to
    the objlist rbtree. We do, however, unconditionally try to remove the
    object from the tree.

    Thanks to NeilBrown for finding this and suggesting this solution.

    Reported-by: NeilBrown
    Signed-off-by: David Howells
    Tested-by: (a customer of) NeilBrown
    Signed-off-by: Linus Torvalds

    David Howells
     

14 Nov, 2013

1 commit

  • Pull block IO core updates from Jens Axboe:
    "This is the pull request for the core changes in the block layer for
    3.13. It contains:

    - The new blk-mq request interface.

    This is a new and more scalable queueing model that marries the
    best part of the request based interface we currently have (which
    is fully featured, but scales poorly) and the bio based "interface"
    which the new drivers for high IOPS devices end up using because
    it's much faster than the request based one.

    The bio interface has no block layer support, since it taps into
    the stack much earlier. This means that drivers end up having to
    implement a lot of functionality on their own, like tagging,
    timeout handling, requeue, etc. The blk-mq interface provides all
    these. Some drivers even provide a switch to select bio or rq and
    has code to handle both, since things like merging only works in
    the rq model and hence is faster for some workloads. This is a
    huge mess. Conversion of these drivers nets us a substantial code
    reduction. Initial results on converting SCSI to this model even
    shows an 8x improvement on single queue devices. So while the
    model was intended to work on the newer multiqueue devices, it has
    substantial improvements for "classic" hardware as well. This code
    has gone through extensive testing and development, it's now ready
    to go. A pull request is coming to convert virtio-blk to this
    model will be will be coming as well, with more drivers scheduled
    for 3.14 conversion.

    - Two blktrace fixes from Jan and Chen Gang.

    - A plug merge fix from Alireza Haghdoost.

    - Conversion of __get_cpu_var() from Christoph Lameter.

    - Fix for sector_div() with 64-bit divider from Geert Uytterhoeven.

    - A fix for a race between request completion and the timeout
    handling from Jeff Moyer. This is what caused the merge conflict
    with blk-mq/core, in case you are looking at that.

    - A dm stacking fix from Mike Snitzer.

    - A code consolidation fix and duplicated code removal from Kent
    Overstreet.

    - A handful of block bug fixes from Mikulas Patocka, fixing a loop
    crash and memory corruption on blk cg.

    - Elevator switch bug fix from Tomoki Sekiyama.

    A heads-up that I had to rebase this branch. Initially the immutable
    bio_vecs had been queued up for inclusion, but a week later, it became
    clear that it wasn't fully cooked yet. So the decision was made to
    pull this out and postpone it until 3.14. It was a straight forward
    rebase, just pruning out the immutable series and the later fixes of
    problems with it. The rest of the patches applied directly and no
    further changes were made"

    * 'for-3.13/core' of git://git.kernel.dk/linux-block: (31 commits)
    block: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO
    block: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO
    block: Do not call sector_div() with a 64-bit divisor
    kernel: trace: blktrace: remove redundent memcpy() in compat_blk_trace_setup()
    block: Consolidate duplicated bio_trim() implementations
    block: Use rw_copy_check_uvector()
    block: Enable sysfs nomerge control for I/O requests in the plug list
    block: properly stack underlying max_segment_size to DM device
    elevator: acquire q->sysfs_lock in elevator_change()
    elevator: Fix a race in elevator switching and md device initialization
    block: Replace __get_cpu_var uses
    bdi: test bdi_init failure
    block: fix a probe argument to blk_register_region
    loop: fix crash if blk_alloc_queue fails
    blk-core: Fix memory corruption if blkcg_init_queue fails
    block: fix race between request completion and timeout handling
    blktrace: Send BLK_TN_PROCESS events to all running traces
    blk-mq: don't disallow request merges for req->special being set
    blk-mq: mq plug list breakage
    blk-mq: fix for flush deadlock
    ...

    Linus Torvalds
     

08 Nov, 2013

1 commit

  • __get_cpu_var() is used for multiple purposes in the kernel source. One of
    them is address calculation via the form &__get_cpu_var(x). This calculates
    the address for the instance of the percpu variable of the current processor
    based on an offset.

    Other use cases are for storing and retrieving data from the current
    processors percpu area. __get_cpu_var() can be used as an lvalue when
    writing data or on the right side of an assignment.

    __get_cpu_var() is defined as :

    #define __get_cpu_var(var) (*this_cpu_ptr(&(var)))

    __get_cpu_var() always only does an address determination. However, store
    and retrieve operations could use a segment prefix (or global register on
    other platforms) to avoid the address calculation.

    this_cpu_write() and this_cpu_read() can directly take an offset into a
    percpu area and use optimized assembly code to read and write per cpu
    variables.

    This patch converts __get_cpu_var into either an explicit address
    calculation using this_cpu_ptr() or into a use of this_cpu operations that
    use the offset. Thereby address calculations are avoided and less registers
    are used when code is generated.

    At the end of the patch set all uses of __get_cpu_var have been removed so
    the macro is removed too.

    The patch set includes passes over all arches as well. Once these operations
    are used throughout then specialized macros can be defined in non -x86
    arches as well in order to optimize per cpu access by f.e. using a global
    register that may be set to the per cpu base.

    Transformations done to __get_cpu_var()

    1. Determine the address of the percpu instance of the current processor.

    DEFINE_PER_CPU(int, y);
    int *x = &__get_cpu_var(y);

    Converts to

    int *x = this_cpu_ptr(&y);

    2. Same as #1 but this time an array structure is involved.

    DEFINE_PER_CPU(int, y[20]);
    int *x = __get_cpu_var(y);

    Converts to

    int *x = this_cpu_ptr(y);

    3. Retrieve the content of the current processors instance of a per cpu
    variable.

    DEFINE_PER_CPU(int, y);
    int x = __get_cpu_var(y)

    Converts to

    int x = __this_cpu_read(y);

    4. Retrieve the content of a percpu struct

    DEFINE_PER_CPU(struct mystruct, y);
    struct mystruct x = __get_cpu_var(y);

    Converts to

    memcpy(&x, this_cpu_ptr(&y), sizeof(x));

    5. Assignment to a per cpu variable

    DEFINE_PER_CPU(int, y)
    __get_cpu_var(y) = x;

    Converts to

    this_cpu_write(y, x);

    6. Increment/Decrement etc of a per cpu variable

    DEFINE_PER_CPU(int, y);
    __get_cpu_var(y)++

    Converts to

    this_cpu_inc(y)

    Signed-off-by: Christoph Lameter
    Signed-off-by: Jens Axboe

    Christoph Lameter
     

28 Sep, 2013

1 commit

  • Provide the ability to enable and disable fscache cookies. A disabled cookie
    will reject or ignore further requests to:

    Acquire a child cookie
    Invalidate and update backing objects
    Check the consistency of a backing object
    Allocate storage for backing page
    Read backing pages
    Write to backing pages

    but still allows:

    Checks/waits on the completion of already in-progress objects
    Uncaching of pages
    Relinquishment of cookies

    Two new operations are provided:

    (1) Disable a cookie:

    void fscache_disable_cookie(struct fscache_cookie *cookie,
    bool invalidate);

    If the cookie is not already disabled, this locks the cookie against other
    dis/enablement ops, marks the cookie as being disabled, discards or
    invalidates any backing objects and waits for cessation of activity on any
    associated object.

    This is a wrapper around a chunk split out of fscache_relinquish_cookie(),
    but it reinitialises the cookie such that it can be reenabled.

    All possible failures are handled internally. The caller should consider
    calling fscache_uncache_all_inode_pages() afterwards to make sure all page
    markings are cleared up.

    (2) Enable a cookie:

    void fscache_enable_cookie(struct fscache_cookie *cookie,
    bool (*can_enable)(void *data),
    void *data)

    If the cookie is not already enabled, this locks the cookie against other
    dis/enablement ops, invokes can_enable() and, if the cookie is not an
    index cookie, will begin the procedure of acquiring backing objects.

    The optional can_enable() function is passed the data argument and returns
    a ruling as to whether or not enablement should actually be permitted to
    begin.

    All possible failures are handled internally. The cookie will only be
    marked as enabled if provisional backing objects are allocated.

    A later patch will introduce these to NFS. Cookie enablement during nfs_open()
    is then contingent on i_writecount <dhowells@redhat.com

    David Howells
     

19 Jun, 2013

4 commits

  • Simplify the way fscache cache objects retain their cookie. The way I
    implemented the cookie storage handling made synchronisation a pain (ie. the
    object state machine can't rely on the cookie actually still being there).

    Instead of the the object being detached from the cookie and the cookie being
    freed in __fscache_relinquish_cookie(), we defer both operations:

    (*) The detachment of the object from the list in the cookie now takes place
    in fscache_drop_object() and is thus governed by the object state machine
    (fscache_detach_from_cookie() has been removed).

    (*) The release of the cookie is now in fscache_object_destroy() - which is
    called by the cache backend just before it frees the object.

    This means that the fscache_cookie struct is now available to the cache all the
    way through from ->alloc_object() to ->drop_object() and ->put_object() -
    meaning that it's no longer necessary to take object->lock to guarantee access.

    However, __fscache_relinquish_cookie() doesn't wait for the object to go all
    the way through to destruction before letting the netfs proceed. That would
    massively slow down the netfs. Since __fscache_relinquish_cookie() leaves the
    cookie around, in must therefore break all attachments to the netfs - which
    includes ->def, ->netfs_data and any outstanding page read/writes.

    To handle this, struct fscache_cookie now has an n_active counter:

    (1) This starts off initialised to 1.

    (2) Any time the cache needs to get at the netfs data, it calls
    fscache_use_cookie() to increment it - if it is not zero. If it was zero,
    then access is not permitted.

    (3) When the cache has finished with the data, it calls fscache_unuse_cookie()
    to decrement it. This does a wake-up on it if it reaches 0.

    (4) __fscache_relinquish_cookie() decrements n_active and then waits for it to
    reach 0. The initialisation to 1 in step (1) ensures that we only get
    wake ups when we're trying to get rid of the cookie.

    This leaves __fscache_relinquish_cookie() a lot simpler.

    ***
    This fixes a problem in the current code whereby if fscache_invalidate() is
    followed sufficiently quickly by fscache_relinquish_cookie() then it is
    possible for __fscache_relinquish_cookie() to have detached the cookie from the
    object and cleared the pointer before a thread is dispatched to process the
    invalidation state in the object state machine.

    Since the pending write clearance was deferred to the invalidation state to
    make it asynchronous, we need to either wait in relinquishment for the stores
    tree to be cleared in the invalidation state or we need to handle the clearance
    in relinquishment.

    Further, if the relinquishment code does clear the tree, then the invalidation
    state need to make the clearance contingent on still having the cookie to hand
    (since that's where the tree is rooted) and we have to prevent the cookie from
    disappearing for the duration.

    This can lead to an oops like the following:

    BUG: unable to handle kernel NULL pointer dereference at 000000000000000c
    ...
    RIP: 0010:[] _spin_lock+0xe/0x30
    ...
    CR2: 000000000000000c ...
    ...
    Process kslowd002 (...)
    ....
    Call Trace:
    [] fscache_invalidate_writes+0x38/0xd0 [fscache]
    [] ? __switch_to+0xd0/0x320
    [] ? find_busiest_queue+0x69/0x150
    [] ? slow_work_enqueue+0x104/0x180
    [] fscache_object_slow_work_execute+0x5e3/0x9d0 [fscache]
    [] ? bit_waitqueue+0x17/0xd0
    [] slow_work_execute+0x233/0x310
    [] slow_work_thread+0x205/0x360
    [] ? autoremove_wake_function+0x0/0x40
    [] ? slow_work_thread+0x0/0x360
    [] kthread+0x96/0xa0
    [] child_rip+0xa/0x20
    [] ? kthread+0x0/0xa0
    [] ? child_rip+0x0/0x20

    The parameter to fscache_invalidate_writes() was object->cookie which is NULL.

    Signed-off-by: David Howells
    Tested-By: Milosz Tanski
    Acked-by: Jeff Layton

    David Howells
     
  • Fix object state machine to have separate work and wait states as that makes
    it easier to envision.

    There are now three kinds of state:

    (1) Work state. This is an execution state. No event processing is performed
    by a work state. The function attached to a work state returns a pointer
    indicating the next state to which the OSM should transition. Returning
    NO_TRANSIT repeats the current state, but goes back to the scheduler
    first.

    (2) Wait state. This is an event processing state. No execution is
    performed by a wait state. Wait states are just tables of "if event X
    occurs, clear it and transition to state Y". The dispatcher returns to
    the scheduler if none of the events in which the wait state has an
    interest are currently pending.

    (3) Out-of-band state. This is a special work state. Transitions to normal
    states can be overridden when an unexpected event occurs (eg. I/O error).
    Instead the dispatcher disables and clears the OOB event and transits to
    the specified work state. This then acts as an ordinary work state,
    though object->state points to the overridden destination. Returning
    NO_TRANSIT resumes the overridden transition.

    In addition, the states have names in their definitions, so there's no need for
    tables of state names. Further, the EV_REQUEUE event is no longer necessary as
    that is automatic for work states.

    Since the states are now separate structs rather than values in an enum, it's
    not possible to use comparisons other than (non-)equality between them, so use
    some object->flags to indicate what phase an object is in.

    The EV_RELEASE, EV_RETIRE and EV_WITHDRAW events have been squished into one
    (EV_KILL). An object flag now carries the information about retirement.

    Similarly, the RELEASING, RECYCLING and WITHDRAWING states have been merged
    into an KILL_OBJECT state and additional states have been added for handling
    waiting dependent objects (JUMPSTART_DEPS and KILL_DEPENDENTS).

    A state has also been added for synchronising with parent object initialisation
    (WAIT_FOR_PARENT) and another for initiating look up (PARENT_READY).

    Signed-off-by: David Howells
    Tested-By: Milosz Tanski
    Acked-by: Jeff Layton

    David Howells
     
  • Wrap checks on object state (mostly outside of fs/fscache/object.c) with
    inline functions so that the mechanism can be replaced.

    Some of the state checks within object.c are left as-is as they will be
    replaced.

    Signed-off-by: David Howells
    Tested-By: Milosz Tanski
    Acked-by: Jeff Layton

    David Howells
     
  • Uninline fscache_object_init() so as not to expose some of the FS-Cache
    internals to the cache backend.

    Signed-off-by: David Howells
    Tested-By: Milosz Tanski
    Acked-by: Jeff Layton

    David Howells
     

21 Dec, 2012

5 commits

  • Add a missing transition to the FS-Cache object state machine to handle an
    invalidation event occuring between the back end completing the object lookup
    by calling fscache_obtained_object() (which moves to state OBJECT_AVAILABLE)
    and the backend returning to fscache_lookup_object() and thence to
    fscache_object_state_machine() which then does a goto lookup_transit to handle
    the transition - but lookup_transit doesn't handle EV_INVALIDATE.

    Without this, the following BUG can be logged:

    FS-Cache: Unsupported event 2 [5/f7] in state OBJECT_AVAILABLE
    ------------[ cut here ]------------
    kernel BUG at fs/fscache/object.c:357!

    Where event 2 is EV_INVALIDATE.

    Signed-off-by: David Howells

    David Howells
     
  • The function to submit an exclusive op (fscache_submit_exclusive_op()) can BUG
    if there's been an I/O error because it may see the parent cache object in an
    unexpected state. It should only BUG if there hasn't been an I/O error.

    In this case the problem was produced by remounting the cache partition to be
    R/O. The EROFS state was detected and the cache was aborted, but not
    everything handled the aborting correctly.

    SysRq : Emergency Remount R/O
    EXT4-fs (sda6): re-mounted. Opts: (null)
    Emergency Remount complete
    CacheFiles: I/O Error: Failed to update xattr with error -30
    FS-Cache: Cache cachefiles stopped due to I/O error
    ------------[ cut here ]------------
    kernel BUG at fs/fscache/operation.c:128!
    invalid opcode: 0000 [#1] SMP
    CPU 0
    Modules linked in: cachefiles nfs fscache auth_rpcgss nfs_acl lockd sunrpc

    Pid: 6612, comm: kworker/u:2 Not tainted 3.1.0-rc8-fsdevel+ #1093 /DG965RY
    RIP: 0010:[] [] fscache_submit_exclusive_op+0x2ad/0x2c2 [fscache]
    RSP: 0018:ffff880000853d40 EFLAGS: 00010206
    RAX: ffff880038ac72a8 RBX: ffff8800181f2260 RCX: ffffffff81f2b2b0
    RDX: 0000000000000001 RSI: ffffffff8179a478 RDI: ffff8800181f2280
    RBP: ffff880000853d60 R08: 0000000000000002 R09: 0000000000000000
    R10: 0000000000000001 R11: 0000000000000001 R12: ffff880038ac7268
    R13: ffff8800181f2280 R14: ffff88003a359190 R15: 000000010122b162
    FS: 0000000000000000(0000) GS:ffff88003bc00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: 00000034cc4a77f0 CR3: 0000000010e96000 CR4: 00000000000006f0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Process kworker/u:2 (pid: 6612, threadinfo ffff880000852000, task ffff880014c3c040)
    Stack:
    ffff8800181f2260 ffff8800181f2310 ffff880038ac7268 ffff8800181f2260
    ffff880000853dc0 ffffffffa0072375 ffff880037ecfe00 ffff88003a359198
    ffff880000853dc0 0000000000000246 0000000000000000 ffff88000a91d308
    Call Trace:
    [] fscache_object_work_func+0x792/0xe65 [fscache]
    [] process_one_work+0x1eb/0x37f
    [] ? process_one_work+0x18d/0x37f
    [] ? fscache_enqueue_dependents+0xd8/0xd8 [fscache]
    [] worker_thread+0x15a/0x21a
    [] ? rescuer_thread+0x188/0x188
    [] kthread+0x7f/0x87
    [] kernel_thread_helper+0x4/0x10
    [] ? finish_task_switch+0x45/0xc0
    [] ? retint_restore_args+0xe/0xe
    [] ? __init_kthread_worker+0x53/0x53
    [] ? gs_change+0xb/0xb

    Signed-off-by: David Howells

    David Howells
     
  • Initialise the object event mask with the calculated mask rather than unmasking
    undefined events also.

    Signed-off-by: David Howells

    David Howells
     
  • Provide a proper invalidation method rather than relying on the netfs retiring
    the cookie it has and getting a new one. The problem with this is that isn't
    easy for the netfs to make sure that it has completed/cancelled all its
    outstanding storage and retrieval operations on the cookie it is retiring.

    Instead, have the cache provide an invalidation method that will cancel or wait
    for all currently outstanding operations before invalidating the cache, and
    will cause new operations to queue up behind that. Whilst invalidation is in
    progress, some requests will be rejected until the cache can stack a barrier on
    the operation queue to cause new operations to be deferred behind it.

    Signed-off-by: David Howells

    David Howells
     
  • Fix the state management of internal fscache operations and the accounting of
    what operations are in what states.

    This is done by:

    (1) Give struct fscache_operation a enum variable that directly represents the
    state it's currently in, rather than spreading this knowledge over a bunch
    of flags, who's processing the operation at the moment and whether it is
    queued or not.

    This makes it easier to write assertions to check the state at various
    points and to prevent invalid state transitions.

    (2) Add an 'operation complete' state and supply a function to indicate the
    completion of an operation (fscache_op_complete()) and make things call
    it. The final call to fscache_put_operation() can then check that an op
    in the appropriate state (complete or cancelled).

    (3) Adjust the use of object->n_ops, ->n_in_progress, ->n_exclusive to better
    govern the state of an object:

    (a) The ->n_ops is now the number of extant operations on the object
    and is now decremented by fscache_put_operation() only.

    (b) The ->n_in_progress is simply the number of objects that have been
    taken off of the object's pending queue for the purposes of being
    run. This is decremented by fscache_op_complete() only.

    (c) The ->n_exclusive is the number of exclusive ops that have been
    submitted and queued or are in progress. It is decremented by
    fscache_op_complete() and by fscache_cancel_op().

    fscache_put_operation() and fscache_operation_gc() now no longer try to
    clean up ->n_exclusive and ->n_in_progress. That was leading to double
    decrements against fscache_cancel_op().

    fscache_cancel_op() now no longer decrements ->n_ops. That was leading to
    double decrements against fscache_put_operation().

    fscache_submit_exclusive_op() now decides whether it has to queue an op
    based on ->n_in_progress being > 0 rather than ->n_ops > 0 as the latter
    will persist in being true even after all preceding operations have been
    cancelled or completed. Furthermore, if an object is active and there are
    runnable ops against it, there must be at least one op running.

    (4) Add a remaining-pages counter (n_pages) to struct fscache_retrieval and
    provide a function to record completion of the pages as they complete.

    When n_pages reaches 0, the operation is deemed to be complete and
    fscache_op_complete() is called.

    Add calls to fscache_retrieval_complete() anywhere we've finished with a
    page we've been given to read or allocate for. This includes places where
    we just return pages to the netfs for reading from the server and where
    accessing the cache fails and we discard the proposed netfs page.

    The bugs in the unfixed state management manifest themselves as oopses like the
    following where the operation completion gets out of sync with return of the
    cookie by the netfs. This is possible because the cache unlocks and returns
    all the netfs pages before recording its completion - which means that there's
    nothing to stop the netfs discarding them and returning the cookie.

    FS-Cache: Cookie 'NFS.fh' still has outstanding reads
    ------------[ cut here ]------------
    kernel BUG at fs/fscache/cookie.c:519!
    invalid opcode: 0000 [#1] SMP
    CPU 1
    Modules linked in: cachefiles nfs fscache auth_rpcgss nfs_acl lockd sunrpc

    Pid: 400, comm: kswapd0 Not tainted 3.1.0-rc7-fsdevel+ #1090 /DG965RY
    RIP: 0010:[] [] __fscache_relinquish_cookie+0x170/0x343 [fscache]
    RSP: 0018:ffff8800368cfb00 EFLAGS: 00010282
    RAX: 000000000000003c RBX: ffff880023cc8790 RCX: 0000000000000000
    RDX: 0000000000002f2e RSI: 0000000000000001 RDI: ffffffff813ab86c
    RBP: ffff8800368cfb50 R08: 0000000000000002 R09: 0000000000000000
    R10: ffff88003a1b7890 R11: ffff88001df6e488 R12: ffff880023d8ed98
    R13: ffff880023cc8798 R14: 0000000000000004 R15: ffff88003b8bf370
    FS: 0000000000000000(0000) GS:ffff88003bd00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: 00000000008ba008 CR3: 0000000023d93000 CR4: 00000000000006e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Process kswapd0 (pid: 400, threadinfo ffff8800368ce000, task ffff88003b8bf040)
    Stack:
    ffff88003b8bf040 ffff88001df6e528 ffff88001df6e528 ffffffffa00b46b0
    ffff88003b8bf040 ffff88001df6e488 ffff88001df6e620 ffffffffa00b46b0
    ffff88001ebd04c8 0000000000000004 ffff8800368cfb70 ffffffffa00b2c91
    Call Trace:
    [] nfs_fscache_release_inode_cookie+0x3b/0x47 [nfs]
    [] nfs_clear_inode+0x3c/0x41 [nfs]
    [] nfs4_evict_inode+0x2f/0x33 [nfs]
    [] evict+0xa1/0x15c
    [] dispose_list+0x2c/0x38
    [] prune_icache_sb+0x28c/0x29b
    [] prune_super+0xd5/0x140
    [] shrink_slab+0x102/0x1ab
    [] balance_pgdat+0x2f2/0x595
    [] ? process_timeout+0xb/0xb
    [] kswapd+0x270/0x289
    [] ? __init_waitqueue_head+0x46/0x46
    [] ? balance_pgdat+0x595/0x595
    [] kthread+0x7f/0x87
    [] kernel_thread_helper+0x4/0x10
    [] ? finish_task_switch+0x45/0xc0
    [] ? retint_restore_args+0xe/0xe
    [] ? __init_kthread_worker+0x53/0x53
    [] ? gs_change+0xb/0xb

    Signed-off-by: David Howells

    David Howells
     

23 Jul, 2010

1 commit

  • Make fscache object state transition callbacks use workqueue instead
    of slow-work. New dedicated unbound CPU workqueue fscache_object_wq
    is created. get/put callbacks are renamed and modified to take
    @object and called directly from the enqueue wrapper and the work
    function. While at it, make all open coded instances of get/put to
    use fscache_get/put_object().

    * Unbound workqueue is used.

    * work_busy() output is printed instead of slow-work flags in object
    debugging outputs. They mean basically the same thing bit-for-bit.

    * sysctl fscache.object_max_active added to control concurrency. The
    default value is nr_cpus clamped between 4 and
    WQ_UNBOUND_MAX_ACTIVE.

    * slow_work_sleep_till_thread_needed() is replaced with fscache
    private implementation fscache_object_sleep_till_congested() which
    waits on fscache_object_wq congestion.

    * debugfs support is dropped for now. Tracing API based debug
    facility is planned to be added.

    Signed-off-by: Tejun Heo
    Acked-by: David Howells

    Tejun Heo
     

30 Mar, 2010

1 commit


20 Nov, 2009

9 commits

  • Catch an overly long wait for an old, dying active object when we want to
    replace it with a new one. The probability is that all the slow-work threads
    are hogged, and the delete can't get a look in.

    What we do instead is:

    (1) if there's nothing in the slow work queue, we sleep until either the dying
    object has finished dying or there is something in the slow work queue
    behind which we can queue our object.

    (2) if there is something in the slow work queue, we return ETIMEDOUT to
    fscache_lookup_object(), which then puts us back on the slow work queue,
    presumably behind the deletion that we're blocked by. We are then
    deferred for a while until we work our way back through the queue -
    without blocking a slow-work thread unnecessarily.

    A backtrace similar to the following may appear in the log without this patch:

    INFO: task kslowd004:5711 blocked for more than 120 seconds.
    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    kslowd004 D 0000000000000000 0 5711 2 0x00000080
    ffff88000340bb80 0000000000000046 ffff88002550d000 0000000000000000
    ffff88002550d000 0000000000000007 ffff88000340bfd8 ffff88002550d2a8
    000000000000ddf0 00000000000118c0 00000000000118c0 ffff88002550d2a8
    Call Trace:
    [] ? trace_hardirqs_on+0xd/0xf
    [] ? cachefiles_wait_bit+0x0/0xd [cachefiles]
    [] cachefiles_wait_bit+0x9/0xd [cachefiles]
    [] __wait_on_bit+0x43/0x76
    [] ? ext3_xattr_get+0x1ec/0x270
    [] out_of_line_wait_on_bit+0x69/0x74
    [] ? cachefiles_wait_bit+0x0/0xd [cachefiles]
    [] ? wake_bit_function+0x0/0x2e
    [] cachefiles_mark_object_active+0x203/0x23b [cachefiles]
    [] cachefiles_walk_to_object+0x558/0x827 [cachefiles]
    [] cachefiles_lookup_object+0xac/0x12a [cachefiles]
    [] fscache_lookup_object+0x1c7/0x214 [fscache]
    [] fscache_object_state_machine+0xa5/0x52d [fscache]
    [] fscache_object_slow_work_execute+0x5f/0xa0 [fscache]
    [] slow_work_execute+0x18f/0x2d1
    [] slow_work_thread+0x1c5/0x308
    [] ? autoremove_wake_function+0x0/0x34
    [] ? slow_work_thread+0x0/0x308
    [] kthread+0x7a/0x82
    [] child_rip+0xa/0x20
    [] ? restore_args+0x0/0x30
    [] ? kthread+0x0/0x82
    [] ? child_rip+0x0/0x20
    1 lock held by kslowd004/5711:
    #0: (&sb->s_type->i_mutex_key#7/1){+.+.+.}, at: [] cachefiles_walk_to_object+0x1b3/0x827 [cachefiles]

    Signed-off-by: David Howells

    David Howells
     
  • FS-Cache objects have an FSCACHE_OBJECT_EV_REQUEUE event that can theoretically
    be raised to ask the state machine to requeue the object for further processing
    before the work function returns to the slow-work facility.

    However, fscache_object_work_execute() was clearing that bit before checking
    the event mask to see whether the object has any pending events that require it
    to be requeued immediately.

    Instead, the bit should be cleared after the check and enqueue.

    Signed-off-by: David Howells

    David Howells
     
  • Start processing an object's operations when that object moves into the DYING
    state as the object cannot be destroyed until all its outstanding operations
    have completed.

    Furthermore, make sure that read and allocation operations handle being woken
    up on a dead object. Such events are recorded in the Allocs.abt and
    Retrvls.abt statistics as viewable through /proc/fs/fscache/stats.

    The code for waiting for object activation for the read and allocation
    operations is also extracted into its own function as it is much the same in
    all cases, differing only in the stats incremented.

    Signed-off-by: David Howells

    David Howells
     
  • We must make sure that FSCACHE_COOKIE_LOOKING_UP is cleared on lookup failure
    (if an object reaches the LC_DYING state), and we should clear it before
    clearing FSCACHE_COOKIE_CREATING.

    If this doesn't happen then fscache_wait_for_deferred_lookup() may hold
    allocation and retrieval operations indefinitely until they're interrupted by
    signals - which in turn pins the dying object until they go away.

    Signed-off-by: David Howells

    David Howells
     
  • The object-available state in the object processing state machine (as
    processed by fscache_object_available()) can't rely on the cookie to be
    available because the FSCACHE_COOKIE_CREATING bit may have been cleared by
    fscache_obtained_object() prior to the object being put into the
    FSCACHE_OBJECT_AVAILABLE state.

    Clearing the FSCACHE_COOKIE_CREATING bit on a cookie permits
    __fscache_relinquish_cookie() to proceed and detach the cookie from the
    object.

    To deal with this, we don't dereference object->cookie in
    fscache_object_available() if the object has already been detached.

    In addition, a couple of assertions are added into fscache_drop_object() to
    make sure the object is unbound from the cookie before it gets there.

    Signed-off-by: David Howells

    David Howells
     
  • Count entries to and exits from cache operation table functions. Maintain
    these as a single counter that's added to or removed from as appropriate.

    Signed-off-by: David Howells

    David Howells
     
  • Allow the current state of all fscache objects to be dumped by doing:

    cat /proc/fs/fscache/objects

    By default, all objects and all fields will be shown. This can be restricted
    by adding a suitable key to one of the caller's keyrings (such as the session
    keyring):

    keyctl add user fscache:objlist "" @s

    The are:

    K Show hexdump of object key (don't show if not given)
    A Show hexdump of object aux data (don't show if not given)

    And paired restrictions:

    C Show objects that have a cookie
    c Show objects that don't have a cookie
    B Show objects that are busy
    b Show objects that aren't busy
    W Show objects that have pending writes
    w Show objects that don't have pending writes
    R Show objects that have outstanding reads
    r Show objects that don't have outstanding reads
    S Show objects that have slow work queued
    s Show objects that don't have slow work queued

    If neither side of a restriction pair is given, then both are implied. For
    example:

    keyctl add user fscache:objlist KB @s

    shows objects that are busy, and lists their object keys, but does not dump
    their auxiliary data. It also implies "CcWwRrSs", but as 'B' is given, 'b' is
    not implied.

    Signed-off-by: David Howells

    David Howells
     
  • Annotate slow-work runqueue proc lines for FS-Cache work items. Objects
    include the object ID and the state. Operations include the object ID, the
    operation ID and the operation type and state.

    Signed-off-by: David Howells

    David Howells
     
  • Wait for outstanding slow work items belonging to a module to clear when
    unregistering that module as a user of the facility. This prevents the put_ref
    code of a work item from being taken away before it returns.

    Signed-off-by: David Howells

    David Howells
     

03 Apr, 2009

1 commit

  • Implement the cache object management state machine.

    The following documentation is added to illuminate the working of this state
    machine. It will also be added as:

    Documentation/filesystems/caching/object.txt

    ====================================================
    IN-KERNEL CACHE OBJECT REPRESENTATION AND MANAGEMENT
    ====================================================

    ==============
    REPRESENTATION
    ==============

    FS-Cache maintains an in-kernel representation of each object that a netfs is
    currently interested in. Such objects are represented by the fscache_cookie
    struct and are referred to as cookies.

    FS-Cache also maintains a separate in-kernel representation of the objects that
    a cache backend is currently actively caching. Such objects are represented by
    the fscache_object struct. The cache backends allocate these upon request, and
    are expected to embed them in their own representations. These are referred to
    as objects.

    There is a 1:N relationship between cookies and objects. A cookie may be
    represented by multiple objects - an index may exist in more than one cache -
    or even by no objects (it may not be cached).

    Furthermore, both cookies and objects are hierarchical. The two hierarchies
    correspond, but the cookies tree is a superset of the union of the object trees
    of multiple caches:

    NETFS INDEX TREE : CACHE 1 : CACHE 2
    : :
    : +-----------+ :
    +----------->| IObject | :
    +-----------+ | : +-----------+ :
    | ICookie |-------+ : | :
    +-----------+ | : | : +-----------+
    | +------------------------------>| IObject |
    | : | : +-----------+
    | : V : |
    | : +-----------+ : |
    V +----------->| IObject | : |
    +-----------+ | : +-----------+ : |
    | ICookie |-------+ : | : V
    +-----------+ | : | : +-----------+
    | +------------------------------>| IObject |
    +-----+-----+ : | : +-----------+
    | | : | : |
    V | : V : |
    +-----------+ | : +-----------+ : |
    | ICookie |------------------------->| IObject | : |
    +-----------+ | : +-----------+ : |
    | V : | : V
    | +-----------+ : | : +-----------+
    | | ICookie |-------------------------------->| IObject |
    | +-----------+ : | : +-----------+
    V | : V : |
    +-----------+ | : +-----------+ : |
    | DCookie |------------------------->| DObject | : |
    +-----------+ | : +-----------+ : |
    | : : |
    +-------+-------+ : : |
    | | : : |
    V V : : V
    +-----------+ +-----------+ : : +-----------+
    | DCookie | | DCookie |------------------------>| DObject |
    +-----------+ +-----------+ : : +-----------+
    : :

    In the above illustration, ICookie and IObject represent indices and DCookie
    and DObject represent data storage objects. Indices may have representation in
    multiple caches, but currently, non-index objects may not. Objects of any type
    may also be entirely unrepresented.

    As far as the netfs API goes, the netfs is only actually permitted to see
    pointers to the cookies. The cookies themselves and any objects attached to
    those cookies are hidden from it.

    ===============================
    OBJECT MANAGEMENT STATE MACHINE
    ===============================

    Within FS-Cache, each active object is managed by its own individual state
    machine. The state for an object is kept in the fscache_object struct, in
    object->state. A cookie may point to a set of objects that are in different
    states.

    Each state has an action associated with it that is invoked when the machine
    wakes up in that state. There are four logical sets of states:

    (1) Preparation: states that wait for the parent objects to become ready. The
    representations are hierarchical, and it is expected that an object must
    be created or accessed with respect to its parent object.

    (2) Initialisation: states that perform lookups in the cache and validate
    what's found and that create on disk any missing metadata.

    (3) Normal running: states that allow netfs operations on objects to proceed
    and that update the state of objects.

    (4) Termination: states that detach objects from their netfs cookies, that
    delete objects from disk, that handle disk and system errors and that free
    up in-memory resources.

    In most cases, transitioning between states is in response to signalled events.
    When a state has finished processing, it will usually set the mask of events in
    which it is interested (object->event_mask) and relinquish the worker thread.
    Then when an event is raised (by calling fscache_raise_event()), if the event
    is not masked, the object will be queued for processing (by calling
    fscache_enqueue_object()).

    PROVISION OF CPU TIME
    ---------------------

    The work to be done by the various states is given CPU time by the threads of
    the slow work facility (see Documentation/slow-work.txt). This is used in
    preference to the workqueue facility because:

    (1) Threads may be completely occupied for very long periods of time by a
    particular work item. These state actions may be doing sequences of
    synchronous, journalled disk accesses (lookup, mkdir, create, setxattr,
    getxattr, truncate, unlink, rmdir, rename).

    (2) Threads may do little actual work, but may rather spend a lot of time
    sleeping on I/O. This means that single-threaded and 1-per-CPU-threaded
    workqueues don't necessarily have the right numbers of threads.

    LOCKING SIMPLIFICATION
    ----------------------

    Because only one worker thread may be operating on any particular object's
    state machine at once, this simplifies the locking, particularly with respect
    to disconnecting the netfs's representation of a cache object (fscache_cookie)
    from the cache backend's representation (fscache_object) - which may be
    requested from either end.

    =================
    THE SET OF STATES
    =================

    The object state machine has a set of states that it can be in. There are
    preparation states in which the object sets itself up and waits for its parent
    object to transit to a state that allows access to its children:

    (1) State FSCACHE_OBJECT_INIT.

    Initialise the object and wait for the parent object to become active. In
    the cache, it is expected that it will not be possible to look an object
    up from the parent object, until that parent object itself has been looked
    up.

    There are initialisation states in which the object sets itself up and accesses
    disk for the object metadata:

    (2) State FSCACHE_OBJECT_LOOKING_UP.

    Look up the object on disk, using the parent as a starting point.
    FS-Cache expects the cache backend to probe the cache to see whether this
    object is represented there, and if it is, to see if it's valid (coherency
    management).

    The cache should call fscache_object_lookup_negative() to indicate lookup
    failure for whatever reason, and should call fscache_obtained_object() to
    indicate success.

    At the completion of lookup, FS-Cache will let the netfs go ahead with
    read operations, no matter whether the file is yet cached. If not yet
    cached, read operations will be immediately rejected with ENODATA until
    the first known page is uncached - as to that point there can be no data
    to be read out of the cache for that file that isn't currently also held
    in the pagecache.

    (3) State FSCACHE_OBJECT_CREATING.

    Create an object on disk, using the parent as a starting point. This
    happens if the lookup failed to find the object, or if the object's
    coherency data indicated what's on disk is out of date. In this state,
    FS-Cache expects the cache to create

    The cache should call fscache_obtained_object() if creation completes
    successfully, fscache_object_lookup_negative() otherwise.

    At the completion of creation, FS-Cache will start processing write
    operations the netfs has queued for an object. If creation failed, the
    write ops will be transparently discarded, and nothing recorded in the
    cache.

    There are some normal running states in which the object spends its time
    servicing netfs requests:

    (4) State FSCACHE_OBJECT_AVAILABLE.

    A transient state in which pending operations are started, child objects
    are permitted to advance from FSCACHE_OBJECT_INIT state, and temporary
    lookup data is freed.

    (5) State FSCACHE_OBJECT_ACTIVE.

    The normal running state. In this state, requests the netfs makes will be
    passed on to the cache.

    (6) State FSCACHE_OBJECT_UPDATING.

    The state machine comes here to update the object in the cache from the
    netfs's records. This involves updating the auxiliary data that is used
    to maintain coherency.

    And there are terminal states in which an object cleans itself up, deallocates
    memory and potentially deletes stuff from disk:

    (7) State FSCACHE_OBJECT_LC_DYING.

    The object comes here if it is dying because of a lookup or creation
    error. This would be due to a disk error or system error of some sort.
    Temporary data is cleaned up, and the parent is released.

    (8) State FSCACHE_OBJECT_DYING.

    The object comes here if it is dying due to an error, because its parent
    cookie has been relinquished by the netfs or because the cache is being
    withdrawn.

    Any child objects waiting on this one are given CPU time so that they too
    can destroy themselves. This object waits for all its children to go away
    before advancing to the next state.

    (9) State FSCACHE_OBJECT_ABORT_INIT.

    The object comes to this state if it was waiting on its parent in
    FSCACHE_OBJECT_INIT, but its parent died. The object will destroy itself
    so that the parent may proceed from the FSCACHE_OBJECT_DYING state.

    (10) State FSCACHE_OBJECT_RELEASING.
    (11) State FSCACHE_OBJECT_RECYCLING.

    The object comes to one of these two states when dying once it is rid of
    all its children, if it is dying because the netfs relinquished its
    cookie. In the first state, the cached data is expected to persist, and
    in the second it will be deleted.

    (12) State FSCACHE_OBJECT_WITHDRAWING.

    The object transits to this state if the cache decides it wants to
    withdraw the object from service, perhaps to make space, but also due to
    error or just because the whole cache is being withdrawn.

    (13) State FSCACHE_OBJECT_DEAD.

    The object transits to this state when the in-memory object record is
    ready to be deleted. The object processor shouldn't ever see an object in
    this state.

    THE SET OF EVENTS
    -----------------

    There are a number of events that can be raised to an object state machine:

    (*) FSCACHE_OBJECT_EV_UPDATE

    The netfs requested that an object be updated. The state machine will ask
    the cache backend to update the object, and the cache backend will ask the
    netfs for details of the change through its cookie definition ops.

    (*) FSCACHE_OBJECT_EV_CLEARED

    This is signalled in two circumstances:

    (a) when an object's last child object is dropped and

    (b) when the last operation outstanding on an object is completed.

    This is used to proceed from the dying state.

    (*) FSCACHE_OBJECT_EV_ERROR

    This is signalled when an I/O error occurs during the processing of some
    object.

    (*) FSCACHE_OBJECT_EV_RELEASE
    (*) FSCACHE_OBJECT_EV_RETIRE

    These are signalled when the netfs relinquishes a cookie it was using.
    The event selected depends on whether the netfs asks for the backing
    object to be retired (deleted) or retained.

    (*) FSCACHE_OBJECT_EV_WITHDRAW

    This is signalled when the cache backend wants to withdraw an object.
    This means that the object will have to be detached from the netfs's
    cookie.

    Because the withdrawing releasing/retiring events are all handled by the object
    state machine, it doesn't matter if there's a collision with both ends trying
    to sever the connection at the same time. The state machine can just pick
    which one it wants to honour, and that effects the other.

    Signed-off-by: David Howells
    Acked-by: Steve Dickson
    Acked-by: Trond Myklebust
    Acked-by: Al Viro
    Tested-by: Daire Byrne

    David Howells