30 May, 2018

1 commit

  • [ Upstream commit 2c98425720233ae3e135add0c7e869b32913502f ]

    If the fscache asynchronous write operation elects to discard a page that's
    pending storage to the cache because the page would be over the store limit
    then it needs to wake the page as someone may be waiting on completion of
    the write.

    The problem is that the store limit may be updated by a different
    asynchronous operation - and so may miss the write - and that the store
    limit may not even get updated until later by the netfs.

    Fix the kernel hang by making fscache_write_op() mark as written any pages
    that are over the limit.

    Signed-off-by: David Howells
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    David Howells
     

07 Sep, 2017

2 commits

  • All users of pagevec_lookup() and pagevec_lookup_range() now pass
    PAGEVEC_SIZE as a desired number of pages.

    Just drop the argument.

    Link: http://lkml.kernel.org/r/20170726114704.7626-11-jack@suse.cz
    Signed-off-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Make pagevec_lookup() (and underlying find_get_pages()) update index to
    the next page where iteration should continue. Most callers want this
    and also pagevec_lookup_tag() already does this.

    Link: http://lkml.kernel.org/r/20170726114704.7626-3-jack@suse.cz
    Signed-off-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

01 Jun, 2016

1 commit


05 Apr, 2016

1 commit

  • PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
    ago with promise that one day it will be possible to implement page
    cache with bigger chunks than PAGE_SIZE.

    This promise never materialized. And unlikely will.

    We have many places where PAGE_CACHE_SIZE assumed to be equal to
    PAGE_SIZE. And it's constant source of confusion on whether
    PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
    especially on the border between fs and mm.

    Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
    breakage to be doable.

    Let's stop pretending that pages in page cache are special. They are
    not.

    The changes are pretty straight-forward:

    - << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

    - page_cache_get() -> get_page();

    - page_cache_release() -> put_page();

    This patch contains automated changes generated with coccinelle using
    script below. For some reason, coccinelle doesn't patch header files.
    I've called spatch for them manually.

    The only adjustment after coccinelle is revert of changes to
    PAGE_CAHCE_ALIGN definition: we are going to drop it later.

    There are few places in the code where coccinelle didn't reach. I'll
    fix them manually in a separate patch. Comments and documentation also
    will be addressed with the separate patch.

    virtual patch

    @@
    expression E;
    @@
    - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    expression E;
    @@
    - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    @@
    - PAGE_CACHE_SHIFT
    + PAGE_SHIFT

    @@
    @@
    - PAGE_CACHE_SIZE
    + PAGE_SIZE

    @@
    @@
    - PAGE_CACHE_MASK
    + PAGE_MASK

    @@
    expression E;
    @@
    - PAGE_CACHE_ALIGN(E)
    + PAGE_ALIGN(E)

    @@
    expression E;
    @@
    - page_cache_get(E)
    + get_page(E)

    @@
    expression E;
    @@
    - page_cache_release(E)
    + put_page(E)

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

11 Nov, 2015

1 commit

  • Handle a write being requested to the page immediately beyond the EOF
    marker on a cache object. Currently this gets an assertion failure in
    CacheFiles because the EOF marker is used there to encode information about
    a partial page at the EOF - which could lead to an unknown blank spot in
    the file if we extend the file over it.

    The problem is actually in fscache where we check the index of the page
    being written against store_limit. store_limit is set to the number of
    pages that we're allowed to store by fscache_set_store_limit() - which
    means it's one more than the index of the last page we're allowed to store.
    The problem is that we permit writing to a page with an index _equal_ to
    the store limit - when we should reject that case.

    Whilst we're at it, change the triggered assertion in CacheFiles to just
    return -ENOBUFS instead.

    The assertion failure looks something like this:

    CacheFiles: Assertion failed
    1000 < 7b1 is false
    ------------[ cut here ]------------
    kernel BUG at fs/cachefiles/rdwr.c:962!
    ...
    RIP: 0010:[] [] cachefiles_write_page+0x273/0x2d0 [cachefiles]

    Cc: stable@vger.kernel.org # v2.6.31+; earlier - that + backport of a17754f (at least)
    Signed-off-by: David Howells
    Signed-off-by: Al Viro

    David Howells
     

07 Nov, 2015

1 commit

  • …d avoiding waking kswapd

    __GFP_WAIT has been used to identify atomic context in callers that hold
    spinlocks or are in interrupts. They are expected to be high priority and
    have access one of two watermarks lower than "min" which can be referred
    to as the "atomic reserve". __GFP_HIGH users get access to the first
    lower watermark and can be called the "high priority reserve".

    Over time, callers had a requirement to not block when fallback options
    were available. Some have abused __GFP_WAIT leading to a situation where
    an optimisitic allocation with a fallback option can access atomic
    reserves.

    This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
    cannot sleep and have no alternative. High priority users continue to use
    __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and
    are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify
    callers that want to wake kswapd for background reclaim. __GFP_WAIT is
    redefined as a caller that is willing to enter direct reclaim and wake
    kswapd for background reclaim.

    This patch then converts a number of sites

    o __GFP_ATOMIC is used by callers that are high priority and have memory
    pools for those requests. GFP_ATOMIC uses this flag.

    o Callers that have a limited mempool to guarantee forward progress clear
    __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
    into this category where kswapd will still be woken but atomic reserves
    are not used as there is a one-entry mempool to guarantee progress.

    o Callers that are checking if they are non-blocking should use the
    helper gfpflags_allow_blocking() where possible. This is because
    checking for __GFP_WAIT as was done historically now can trigger false
    positives. Some exceptions like dm-crypt.c exist where the code intent
    is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
    flag manipulations.

    o Callers that built their own GFP flags instead of starting with GFP_KERNEL
    and friends now also need to specify __GFP_KSWAPD_RECLAIM.

    The first key hazard to watch out for is callers that removed __GFP_WAIT
    and was depending on access to atomic reserves for inconspicuous reasons.
    In some cases it may be appropriate for them to use __GFP_HIGH.

    The second key hazard is callers that assembled their own combination of
    GFP flags instead of starting with something like GFP_KERNEL. They may
    now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless
    if it's missed in most cases as other activity will wake kswapd.

    Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Vitaly Wool <vitalywool@gmail.com>
    Cc: Rik van Riel <riel@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Mel Gorman
     

02 Apr, 2015

5 commits

  • Now that the retrieval operation may be disposed of by fscache_put_operation()
    before we actually set the context, the retrieval-specific cleanup operation
    can produce a NULL-pointer dereference when it tries to unconditionally clean
    up the netfs context.

    Given that it is expected that we'll get at least as far as the place where we
    currently set the context pointer and it is unlikely we'll go through the
    error handling paths prior to that point, retain the context right from the
    point that the retrieval op is allocated.

    Concomitant to this, we need to retain the cookie pointer in the retrieval op
    also so that we can call the netfs to release its context in the release
    method.

    In addition, we might now get into fscache_release_retrieval_op() with the op
    only initialised. To this end, set the operation to DEAD only after the
    release method has been called and skip the n_pages test upon cleanup if the
    op is still in the INITIALISED state.

    Without these changes, the following oops might be seen:

    BUG: unable to handle kernel NULL pointer dereference at 00000000000000b8
    ...
    RIP: 0010:[] fscache_release_retrieval_op+0xae/0x100
    ...
    Call Trace:
    [] fscache_put_operation+0x117/0x2e0
    [] __fscache_read_or_alloc_pages+0x351/0x3ac
    [] __nfs_readpages_from_fscache+0x59/0xbf [nfs]
    [] nfs_readpages+0x10c/0x185 [nfs]
    [] ? alloc_pages_current+0x119/0x13e
    [] ? __page_cache_alloc+0xfb/0x10a
    [] __do_page_cache_readahead+0x188/0x22c
    [] ondemand_readahead+0x29e/0x2af
    [] page_cache_sync_readahead+0x38/0x3a
    [] generic_file_read_iter+0x1a2/0x55a
    [] ? nfs_revalidate_mapping+0xd6/0x288 [nfs]
    [] nfs_file_read+0x49/0x70 [nfs]
    [] new_sync_read+0x78/0x9c
    [] __vfs_read+0x13/0x38
    [] vfs_read+0x95/0x121
    [] SyS_read+0x4c/0x8a
    [] system_call_fastpath+0x12/0x17

    Signed-off-by: David Howells
    Reviewed-by: Steve Dickson
    Acked-by: Jeff Layton

    David Howells
     
  • Any time an incomplete operation is cancelled, the operation cancellation
    function needs to be called to clean up. This is currently being passed
    directly to some of the functions that might want to call it, but not all.

    Instead, pass the cancellation method pointer to the fscache_operation_init()
    and have that cache it in the operation struct. Further, plug in a dummy
    cancellation handler if the caller declines to set one as this allows us to
    call the function unconditionally (the extra overhead isn't worth bothering
    about as we don't expect to be calling this typically).

    The cancellation method must thence be called everywhere the CANCELLED state
    is set. Note that we call it *before* setting the CANCELLED state such that
    the method can use the old state value to guide its operation.

    fscache_do_cancel_retrieval() needs moving higher up in the sources so that
    the init function can use it now.

    Without this, the following oops may be seen:

    FS-Cache: Assertion failed
    FS-Cache: 3 == 0 is false
    ------------[ cut here ]------------
    kernel BUG at ../fs/fscache/page.c:261!
    ...
    RIP: 0010:[] fscache_release_retrieval_op+0x77/0x100
    [] fscache_put_operation+0x114/0x2da
    [] __fscache_read_or_alloc_pages+0x358/0x3b3
    [] __nfs_readpages_from_fscache+0x59/0xbf [nfs]
    [] nfs_readpages+0x10c/0x185 [nfs]
    [] ? alloc_pages_current+0x119/0x13e
    [] ? __page_cache_alloc+0xfb/0x10a
    [] __do_page_cache_readahead+0x188/0x22c
    [] ondemand_readahead+0x29e/0x2af
    [] page_cache_sync_readahead+0x38/0x3a
    [] generic_file_read_iter+0x1a2/0x55a
    [] ? nfs_revalidate_mapping+0xd6/0x288 [nfs]
    [] nfs_file_read+0x49/0x70 [nfs]
    [] new_sync_read+0x78/0x9c
    [] __vfs_read+0x13/0x38
    [] vfs_read+0x95/0x121
    [] SyS_read+0x4c/0x8a
    [] system_call_fastpath+0x12/0x17

    The assertion is showing that the remaining number of pages (n_pages) is not 0
    when the operation is being released.

    Signed-off-by: David Howells
    Reviewed-by: Steve Dickson
    Acked-by: Jeff Layton

    David Howells
     
  • Call fscache_put_operation() or a wrapper on any op that has gone through
    fscache_operation_init() so that the accounting shown in /proc is done
    correctly, specifically fscache_n_op_release.

    fscache_put_operation() therefore now allows an op in the INITIALISED state as
    well as in the CANCELLED and COMPLETE states.

    Note that this means that an operation can get put that doesn't have its
    ->object pointer filled in, so anything that depends on the object needs to be
    conditional in fscache_put_operation().

    Signed-off-by: David Howells
    Reviewed-by: Steve Dickson
    Acked-by: Jeff Layton

    David Howells
     
  • Currently, fscache_cancel_op() only cancels pending operations - attempts to
    cancel in-progress operations are ignored. This leads to a problem in
    fscache_wait_for_operation_activation() whereby the wait is terminated, but
    the object has been killed.

    The check at the end of the function now triggers because it's no longer
    contingent on the cache having produced an I/O error since the commit that
    fixed the logic error in fscache_object_is_dead().

    The result of the check is that it tries to cancel the operation - but since
    the object may not be pending by this point, the cancellation request may be
    ignored - with the result that the the object is just put by the caller and
    fscache_put_operation has an assertion failure because the operation isn't in
    either the COMPLETE or the CANCELLED states.

    To fix this, we permit in-progress ops to be cancelled under some
    circumstances.

    The bug results in an oops that looks something like this:

    FS-Cache: fscache_wait_for_operation_activation() = -ENOBUFS [obj dead 3]
    FS-Cache:
    FS-Cache: Assertion failed
    FS-Cache: 3 == 5 is false
    ------------[ cut here ]------------
    kernel BUG at ../fs/fscache/operation.c:432!
    ...
    RIP: 0010:[] fscache_put_operation+0xf2/0x2cd
    Call Trace:
    [] __fscache_read_or_alloc_pages+0x2ec/0x3b3
    [] __nfs_readpages_from_fscache+0x59/0xbf [nfs]
    [] nfs_readpages+0x10c/0x185 [nfs]
    [] ? alloc_pages_current+0x119/0x13e
    [] ? __page_cache_alloc+0xfb/0x10a
    [] __do_page_cache_readahead+0x188/0x22c
    [] ondemand_readahead+0x29e/0x2af
    [] page_cache_sync_readahead+0x38/0x3a
    [] generic_file_read_iter+0x1a2/0x55a
    [] ? nfs_revalidate_mapping+0xd6/0x288 [nfs]
    [] nfs_file_read+0x49/0x70 [nfs]
    [] new_sync_read+0x78/0x9c
    [] __vfs_read+0x13/0x38
    [] vfs_read+0x95/0x121
    [] SyS_read+0x4c/0x8a
    [] system_call_fastpath+0x12/0x17

    Signed-off-by: David Howells
    Reviewed-by: Steve Dickson
    Acked-by: Jeff Layton

    David Howells
     
  • fscache_object_is_dead() returns true only if the object is marked dead and
    the cache got an I/O error. This should be a logical OR instead. Since two
    of the callers got split up into handling for separate subcases, expand the
    other callers and kill the function. This is probably the right thing to do
    anyway since one of the subcases isn't about the object at all, but rather
    about the cache.

    Signed-off-by: David Howells
    Reviewed-by: Steve Dickson
    Acked-by: Jeff Layton

    David Howells
     

18 Sep, 2014

1 commit

  • In rare cases under heavy VMA pressure the ref count for a fscache cookie
    becomes corrupt. In this case we decrement ref count even if we fail before
    incrementing the refcount.

    FS-Cache: Assertion failed bnode-eca5f9c6/syslog
    0 > 0 is false
    ------------[ cut here ]------------
    kernel BUG at fs/fscache/cookie.c:519!
    invalid opcode: 0000 [#1] SMP
    Call Trace:
    [] __fscache_relinquish_cookie+0x50/0x220 [fscache]
    [] ceph_fscache_unregister_inode_cookie+0x3e/0x50 [ceph]
    [] ceph_destroy_inode+0x33/0x200 [ceph]
    [] ? __fsnotify_inode_delete+0xe/0x10
    [] destroy_inode+0x3c/0x70
    [] evict+0x111/0x180
    [] iput+0x103/0x190
    [] __dentry_kill+0x1c8/0x220
    [] shrink_dentry_list+0xf1/0x250
    [] prune_dcache_sb+0x4c/0x60
    [] super_cache_scan+0xff/0x170
    [] shrink_slab_node+0x140/0x2c0
    [] shrink_slab+0x8a/0x130
    [] balance_pgdat+0x3e2/0x5d0
    [] kswapd+0x16a/0x4a0
    [] ? __wake_up_sync+0x20/0x20
    [] ? balance_pgdat+0x5d0/0x5d0
    [] kthread+0xc9/0xe0
    [] ? ftrace_raw_event_xen_mmu_release_ptpage+0x70/0x90
    [] ? flush_kthread_worker+0xb0/0xb0
    [] ret_from_fork+0x7c/0xb0
    [] ? flush_kthread_worker+0xb0/0xb0
    RIP [] __fscache_disable_cookie+0x1db/0x210 [fscache]
    RSP
    ---[ end trace 254d0d7c74a01f25 ]---

    Signed-off-by: Milosz Tanski
    Signed-off-by: David Howells

    Milosz Tanski
     

27 Aug, 2014

1 commit

  • This is meant to avoid a recusive hang caused by underlying filesystem trying
    to grab a free page and causing a write-out.

    INFO: task kworker/u30:7:28375 blocked for more than 120 seconds.
    Not tainted 3.15.0-virtual #74
    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    kworker/u30:7 D 0000000000000000 0 28375 2 0x00000000
    Workqueue: fscache_operation fscache_op_work_func [fscache]
    ffff88000b147148 0000000000000046 0000000000000000 ffff88000b1471c8
    ffff8807aa031820 0000000000014040 ffff88000b147fd8 0000000000014040
    ffff880f0c50c860 ffff8807aa031820 ffff88000b147158 ffff88007be59cd0
    Call Trace:
    [] schedule+0x29/0x70
    [] __fscache_wait_on_page_write+0x55/0x90 [fscache]
    [] ? __wake_up_sync+0x20/0x20
    [] __fscache_maybe_release_page+0x65/0x1e0 [fscache]
    [] ceph_releasepage+0x83/0x100 [ceph]
    [] ? anon_vma_fork+0x130/0x130
    [] try_to_release_page+0x32/0x50
    [] shrink_page_list+0x7e6/0x9d0
    [] ? isolate_lru_pages.isra.73+0x78/0x1e0
    [] shrink_inactive_list+0x252/0x4c0
    [] shrink_lruvec+0x3e1/0x670
    [] shrink_zone+0x3f/0x110
    [] do_try_to_free_pages+0x1d6/0x450
    [] ? zone_statistics+0x99/0xc0
    [] try_to_free_pages+0xc4/0x180
    [] __alloc_pages_nodemask+0x6b2/0xa60
    [] ? __find_get_block+0xbe/0x250
    [] ? wake_up_bit+0x2e/0x40
    [] alloc_pages_current+0xb3/0x180
    [] __page_cache_alloc+0xb7/0xd0
    [] grab_cache_page_write_begin+0x7c/0xe0
    [] ? ext4_mark_inode_dirty+0x82/0x220
    [] ext4_da_write_begin+0x89/0x2d0
    [] generic_perform_write+0xbe/0x1d0
    [] ? update_time+0x81/0xc0
    [] ? mnt_clone_write+0x12/0x30
    [] __generic_file_aio_write+0x1ce/0x3f0
    [] generic_file_aio_write+0x5e/0xe0
    [] ext4_file_write+0x9f/0x410
    [] ? ext4_file_open+0x66/0x180
    [] do_sync_write+0x5a/0x90
    [] cachefiles_write_page+0x149/0x430 [cachefiles]
    [] ? radix_tree_gang_lookup_tag+0x89/0xd0
    [] fscache_write_op+0x222/0x3b0 [fscache]
    [] fscache_op_work_func+0x3a/0x100 [fscache]
    [] process_one_work+0x179/0x4a0
    [] worker_thread+0x11b/0x370
    [] ? manage_workers.isra.21+0x2e0/0x2e0
    [] kthread+0xc9/0xe0
    [] ? ftrace_raw_event_xen_mmu_release_ptpage+0x70/0x90
    [] ? flush_kthread_worker+0xb0/0xb0
    [] ret_from_fork+0x7c/0xb0
    [] ? flush_kthread_worker+0xb0/0xb0

    Signed-off-by: Milosz Tanski
    Signed-off-by: David Howells

    Milosz Tanski
     

16 Jul, 2014

1 commit

  • The current "wait_on_bit" interface requires an 'action'
    function to be provided which does the actual waiting.
    There are over 20 such functions, many of them identical.
    Most cases can be satisfied by one of just two functions, one
    which uses io_schedule() and one which just uses schedule().

    So:
    Rename wait_on_bit and wait_on_bit_lock to
    wait_on_bit_action and wait_on_bit_lock_action
    to make it explicit that they need an action function.

    Introduce new wait_on_bit{,_lock} and wait_on_bit{,_lock}_io
    which are *not* given an action function but implicitly use
    a standard one.
    The decision to error-out if a signal is pending is now made
    based on the 'mode' argument rather than being encoded in the action
    function.

    All instances of the old wait_on_bit and wait_on_bit_lock which
    can use the new version have been changed accordingly and their
    action functions have been discarded.
    wait_on_bit{_lock} does not return any specific error code in the
    event of a signal so the caller must check for non-zero and
    interpolate their own error code as appropriate.

    The wait_on_bit() call in __fscache_wait_on_invalidate() was
    ambiguous as it specified TASK_UNINTERRUPTIBLE but used
    fscache_wait_bit_interruptible as an action function.
    David Howells confirms this should be uniformly
    "uninterruptible"

    The main remaining user of wait_on_bit{,_lock}_action is NFS
    which needs to use a freezer-aware schedule() call.

    A comment in fs/gfs2/glock.c notes that having multiple 'action'
    functions is useful as they display differently in the 'wchan'
    field of 'ps'. (and /proc/$PID/wchan).
    As the new bit_wait{,_io} functions are tagged "__sched", they
    will not show up at all, but something higher in the stack. So
    the distinction will still be visible, only with different
    function names (gds2_glock_wait versus gfs2_glock_dq_wait in the
    gfs2/glock.c case).

    Since first version of this patch (against 3.15) two new action
    functions appeared, on in NFS and one in CIFS. CIFS also now
    uses an action function that makes the same freezer aware
    schedule call as NFS.

    Signed-off-by: NeilBrown
    Acked-by: David Howells (fscache, keys)
    Acked-by: Steven Whitehouse (gfs2)
    Acked-by: Peter Zijlstra
    Cc: Oleg Nesterov
    Cc: Steve French
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20140707051603.28027.72349.stgit@notabene.brown
    Signed-off-by: Ingo Molnar

    NeilBrown
     

05 Jun, 2014

1 commit


28 Sep, 2013

2 commits

  • Provide the ability to enable and disable fscache cookies. A disabled cookie
    will reject or ignore further requests to:

    Acquire a child cookie
    Invalidate and update backing objects
    Check the consistency of a backing object
    Allocate storage for backing page
    Read backing pages
    Write to backing pages

    but still allows:

    Checks/waits on the completion of already in-progress objects
    Uncaching of pages
    Relinquishment of cookies

    Two new operations are provided:

    (1) Disable a cookie:

    void fscache_disable_cookie(struct fscache_cookie *cookie,
    bool invalidate);

    If the cookie is not already disabled, this locks the cookie against other
    dis/enablement ops, marks the cookie as being disabled, discards or
    invalidates any backing objects and waits for cessation of activity on any
    associated object.

    This is a wrapper around a chunk split out of fscache_relinquish_cookie(),
    but it reinitialises the cookie such that it can be reenabled.

    All possible failures are handled internally. The caller should consider
    calling fscache_uncache_all_inode_pages() afterwards to make sure all page
    markings are cleared up.

    (2) Enable a cookie:

    void fscache_enable_cookie(struct fscache_cookie *cookie,
    bool (*can_enable)(void *data),
    void *data)

    If the cookie is not already enabled, this locks the cookie against other
    dis/enablement ops, invokes can_enable() and, if the cookie is not an
    index cookie, will begin the procedure of acquiring backing objects.

    The optional can_enable() function is passed the data argument and returns
    a ruling as to whether or not enablement should actually be permitted to
    begin.

    All possible failures are handled internally. The cookie will only be
    marked as enabled if provisional backing objects are allocated.

    A later patch will introduce these to NFS. Cookie enablement during nfs_open()
    is then contingent on i_writecount <dhowells@redhat.com

    David Howells
     
  • Add wrapper functions for dealing with cookie->n_active:

    (*) __fscache_use_cookie() to increment it.

    (*) __fscache_unuse_cookie() to decrement and test against zero.

    (*) __fscache_wake_unused_cookie() to wake up anyone waiting for it to reach
    zero.

    The second and third are split so that the third can be done after cookie->lock
    has been released in case the waiter wakes up whilst we're still holding it and
    tries to get it.

    We will need to wake-on-zero once the cookie disablement patch is applied
    because it will then be possible to see n_active become zero without the cookie
    being relinquished.

    Also move the cookie usement out of fscache_attr_changed_op() and into
    fscache_attr_changed() and the operation struct so that cookie disablement
    will be able to track it.

    Whilst we're at it, only increment n_active if we're about to do
    fscache_submit_op() so that we don't have to deal with undoing it if anything
    earlier fails. Possibly this should be moved into fscache_submit_op() which
    could look at FSCACHE_OP_UNUSE_COOKIE.

    Signed-off-by: David Howells

    David Howells
     

12 Sep, 2013

1 commit

  • With users of radix_tree_preload() run from interrupt (block/blk-ioc.c is
    one such possible user), the following race can happen:

    radix_tree_preload()
    ...
    radix_tree_insert()
    radix_tree_node_alloc()
    if (rtp->nr) {
    ret = rtp->nodes[rtp->nr - 1];

    ...
    radix_tree_preload()
    ...
    radix_tree_insert()
    radix_tree_node_alloc()
    if (rtp->nr) {
    ret = rtp->nodes[rtp->nr - 1];

    And we give out one radix tree node twice. That clearly results in radix
    tree corruption with different results (usually OOPS) depending on which
    two users of radix tree race.

    We fix the problem by making radix_tree_node_alloc() always allocate fresh
    radix tree nodes when in interrupt. Using preloading when in interrupt
    doesn't make sense since all the allocations have to be atomic anyway and
    we cannot steal nodes from process-context users because some users rely
    on radix_tree_insert() succeeding after radix_tree_preload().
    in_interrupt() check is somewhat ugly but we cannot simply key off passed
    gfp_mask as that is acquired from root_gfp_mask() and thus the same for
    all preload users.

    Another part of the fix is to avoid node preallocation in
    radix_tree_preload() when passed gfp_mask doesn't allow waiting. Again,
    preallocation in such case doesn't make sense and when preallocation would
    happen in interrupt we could possibly leak some allocated nodes. However,
    some users of radix_tree_preload() require following radix_tree_insert()
    to succeed. To avoid unexpected effects for these users,
    radix_tree_preload() only warns if passed gfp mask doesn't allow waiting
    and we provide a new function radix_tree_maybe_preload() for those users
    which get different gfp mask from different call sites and which are
    prepared to handle radix_tree_insert() failure.

    Signed-off-by: Jan Kara
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

06 Sep, 2013

2 commits

  • Currently the fscache code expect the netfs to call fscache_readpages_or_alloc
    inside the aops readpages callback. It marks all the pages in the list
    provided by readahead with PG_private_2. In the cases that the netfs fails to
    read all the pages (which is legal) it ends up returning to the readahead and
    triggering a BUG. This happens because the page list still contains marked
    pages.

    This patch implements a simple fscache_readpages_cancel function that the netfs
    should call before returning from readpages. It will revoke the pages from the
    underlying cache backend and unmark them.

    The problem was originally worked out in the Ceph devel tree, but it also
    occurs in CIFS. It appears that NFS, AFS and 9P are okay as read_cache_pages()
    will clean up the unprocessed pages in the case of an error.

    This can be used to address the following oops:

    [12410647.597278] BUG: Bad page state in process petabucket pfn:3d504e
    [12410647.597292] page:ffffea000f541380 count:0 mapcount:0 mapping:
    (null) index:0x0
    [12410647.597298] page flags: 0x200000000001000(private_2)

    ...

    [12410647.597334] Call Trace:
    [12410647.597345] [] dump_stack+0x19/0x1b
    [12410647.597356] [] bad_page+0xc7/0x120
    [12410647.597359] [] free_pages_prepare+0x10e/0x120
    [12410647.597361] [] free_hot_cold_page+0x40/0x170
    [12410647.597363] [] __put_single_page+0x27/0x30
    [12410647.597365] [] put_page+0x25/0x40
    [12410647.597376] [] ceph_readpages+0x2e9/0x6e0 [ceph]
    [12410647.597379] [] __do_page_cache_readahead+0x1af/0x260
    [12410647.597382] [] ra_submit+0x21/0x30
    [12410647.597384] [] filemap_fault+0x254/0x490
    [12410647.597387] [] __do_fault+0x6f/0x4e0
    [12410647.597391] [] ? __switch_to+0x16d/0x4a0
    [12410647.597395] [] ? finish_task_switch+0x5a/0xc0
    [12410647.597398] [] handle_pte_fault+0xf6/0x930
    [12410647.597401] [] ? pte_mfn_to_pfn+0x93/0x110
    [12410647.597403] [] ? xen_pmd_val+0xe/0x10
    [12410647.597405] [] ? __raw_callee_save_xen_pmd_val+0x11/0x1e
    [12410647.597407] [] handle_mm_fault+0x251/0x370
    [12410647.597411] [] ? call_rwsem_down_read_failed+0x14/0x30
    [12410647.597414] [] __do_page_fault+0x1aa/0x550
    [12410647.597418] [] ? up_write+0x1d/0x20
    [12410647.597422] [] ? vm_mmap_pgoff+0xbc/0xe0
    [12410647.597425] [] ? SyS_mmap_pgoff+0xd8/0x240
    [12410647.597427] [] do_page_fault+0xe/0x10
    [12410647.597431] [] page_fault+0x28/0x30

    Signed-off-by: Milosz Tanski
    Signed-off-by: David Howells

    Milosz Tanski
     
  • Extend the fscache netfs API so that the netfs can ask as to whether a cache
    object is up to date with respect to its corresponding netfs object:

    int fscache_check_consistency(struct fscache_cookie *cookie)

    This will call back to the netfs to check whether the auxiliary data associated
    with a cookie is correct. It returns 0 if it is and -ESTALE if it isn't; it
    may also return -ENOMEM and -ERESTARTSYS.

    The backends now have to implement a mandatory operation pointer:

    int (*check_consistency)(struct fscache_object *object)

    that corresponds to the above API call. FS-Cache takes care of pinning the
    object and the cookie in memory and managing this call with respect to the
    object state.

    Original-author: Hongyi Jia
    Signed-off-by: David Howells
    cc: Hongyi Jia
    cc: Milosz Tanski

    David Howells
     

19 Jun, 2013

5 commits

  • struct fscache_retrieval contains a count of the number of pages that still
    need some processing (n_pages). This is decremented as the pages are
    processed.

    However, this needs to be atomic as fscache_retrieval_complete() (I think) just
    occasionally may be called from cachefiles_read_backing_file() and
    cachefiles_read_copier() simultaneously.

    This happens when an fscache_read_or_alloc_pages() request containing a lot of
    pages (say a couple of hundred) is being processed. The read on each backing
    page is dispatched individually because we need to insert a monitor into the
    waitqueue to catch when the read completes. However, under low-memory
    conditions, we might be forced to wait in the allocator - and this gives the
    I/O on the backing page a chance to complete first.

    When the I/O completes, fscache_enqueue_retrieval() chucks the retrieval onto
    the workqueue without waiting for the operation to finish the initial I/O
    dispatch (we want to release any pages we can as soon as we can), thus both can
    end up running simultaneously and potentially attempting to partially complete
    the retrieval simultaneously (ENOMEM may occur, backing pages may already be in
    the page cache).

    This was demonstrated by parallelling the non-atomic counter with an atomic
    counter and printing both of them when the assertion fails. At this point, the
    atomic counter has reached zero, but the non-atomic counter has not.

    To fix this, make the counter an atomic_t.

    This results in the following bug appearing

    FS-Cache: Assertion failed
    3 == 5 is false
    ------------[ cut here ]------------
    kernel BUG at fs/fscache/operation.c:421!

    or

    FS-Cache: Assertion failed
    3 == 5 is false
    ------------[ cut here ]------------
    kernel BUG at fs/fscache/operation.c:414!

    With a backtrace like the following:

    RIP: 0010:[] fscache_put_operation+0x1ad/0x240 [fscache]
    Call Trace:
    [] fscache_retrieval_work+0x55/0x270 [fscache]
    [] ? fscache_retrieval_work+0x0/0x270 [fscache]
    [] worker_thread+0x170/0x2a0
    [] ? autoremove_wake_function+0x0/0x40
    [] ? worker_thread+0x0/0x2a0
    [] kthread+0x96/0xa0
    [] child_rip+0xa/0x20
    [] ? kthread+0x0/0xa0
    [] ? child_rip+0x0/0x20

    Signed-off-by: David Howells
    Reviewed-and-tested-By: Milosz Tanski
    Acked-by: Jeff Layton

    David Howells
     
  • Simplify the way fscache cache objects retain their cookie. The way I
    implemented the cookie storage handling made synchronisation a pain (ie. the
    object state machine can't rely on the cookie actually still being there).

    Instead of the the object being detached from the cookie and the cookie being
    freed in __fscache_relinquish_cookie(), we defer both operations:

    (*) The detachment of the object from the list in the cookie now takes place
    in fscache_drop_object() and is thus governed by the object state machine
    (fscache_detach_from_cookie() has been removed).

    (*) The release of the cookie is now in fscache_object_destroy() - which is
    called by the cache backend just before it frees the object.

    This means that the fscache_cookie struct is now available to the cache all the
    way through from ->alloc_object() to ->drop_object() and ->put_object() -
    meaning that it's no longer necessary to take object->lock to guarantee access.

    However, __fscache_relinquish_cookie() doesn't wait for the object to go all
    the way through to destruction before letting the netfs proceed. That would
    massively slow down the netfs. Since __fscache_relinquish_cookie() leaves the
    cookie around, in must therefore break all attachments to the netfs - which
    includes ->def, ->netfs_data and any outstanding page read/writes.

    To handle this, struct fscache_cookie now has an n_active counter:

    (1) This starts off initialised to 1.

    (2) Any time the cache needs to get at the netfs data, it calls
    fscache_use_cookie() to increment it - if it is not zero. If it was zero,
    then access is not permitted.

    (3) When the cache has finished with the data, it calls fscache_unuse_cookie()
    to decrement it. This does a wake-up on it if it reaches 0.

    (4) __fscache_relinquish_cookie() decrements n_active and then waits for it to
    reach 0. The initialisation to 1 in step (1) ensures that we only get
    wake ups when we're trying to get rid of the cookie.

    This leaves __fscache_relinquish_cookie() a lot simpler.

    ***
    This fixes a problem in the current code whereby if fscache_invalidate() is
    followed sufficiently quickly by fscache_relinquish_cookie() then it is
    possible for __fscache_relinquish_cookie() to have detached the cookie from the
    object and cleared the pointer before a thread is dispatched to process the
    invalidation state in the object state machine.

    Since the pending write clearance was deferred to the invalidation state to
    make it asynchronous, we need to either wait in relinquishment for the stores
    tree to be cleared in the invalidation state or we need to handle the clearance
    in relinquishment.

    Further, if the relinquishment code does clear the tree, then the invalidation
    state need to make the clearance contingent on still having the cookie to hand
    (since that's where the tree is rooted) and we have to prevent the cookie from
    disappearing for the duration.

    This can lead to an oops like the following:

    BUG: unable to handle kernel NULL pointer dereference at 000000000000000c
    ...
    RIP: 0010:[] _spin_lock+0xe/0x30
    ...
    CR2: 000000000000000c ...
    ...
    Process kslowd002 (...)
    ....
    Call Trace:
    [] fscache_invalidate_writes+0x38/0xd0 [fscache]
    [] ? __switch_to+0xd0/0x320
    [] ? find_busiest_queue+0x69/0x150
    [] ? slow_work_enqueue+0x104/0x180
    [] fscache_object_slow_work_execute+0x5e3/0x9d0 [fscache]
    [] ? bit_waitqueue+0x17/0xd0
    [] slow_work_execute+0x233/0x310
    [] slow_work_thread+0x205/0x360
    [] ? autoremove_wake_function+0x0/0x40
    [] ? slow_work_thread+0x0/0x360
    [] kthread+0x96/0xa0
    [] child_rip+0xa/0x20
    [] ? kthread+0x0/0xa0
    [] ? child_rip+0x0/0x20

    The parameter to fscache_invalidate_writes() was object->cookie which is NULL.

    Signed-off-by: David Howells
    Tested-By: Milosz Tanski
    Acked-by: Jeff Layton

    David Howells
     
  • Fix object state machine to have separate work and wait states as that makes
    it easier to envision.

    There are now three kinds of state:

    (1) Work state. This is an execution state. No event processing is performed
    by a work state. The function attached to a work state returns a pointer
    indicating the next state to which the OSM should transition. Returning
    NO_TRANSIT repeats the current state, but goes back to the scheduler
    first.

    (2) Wait state. This is an event processing state. No execution is
    performed by a wait state. Wait states are just tables of "if event X
    occurs, clear it and transition to state Y". The dispatcher returns to
    the scheduler if none of the events in which the wait state has an
    interest are currently pending.

    (3) Out-of-band state. This is a special work state. Transitions to normal
    states can be overridden when an unexpected event occurs (eg. I/O error).
    Instead the dispatcher disables and clears the OOB event and transits to
    the specified work state. This then acts as an ordinary work state,
    though object->state points to the overridden destination. Returning
    NO_TRANSIT resumes the overridden transition.

    In addition, the states have names in their definitions, so there's no need for
    tables of state names. Further, the EV_REQUEUE event is no longer necessary as
    that is automatic for work states.

    Since the states are now separate structs rather than values in an enum, it's
    not possible to use comparisons other than (non-)equality between them, so use
    some object->flags to indicate what phase an object is in.

    The EV_RELEASE, EV_RETIRE and EV_WITHDRAW events have been squished into one
    (EV_KILL). An object flag now carries the information about retirement.

    Similarly, the RELEASING, RECYCLING and WITHDRAWING states have been merged
    into an KILL_OBJECT state and additional states have been added for handling
    waiting dependent objects (JUMPSTART_DEPS and KILL_DEPENDENTS).

    A state has also been added for synchronising with parent object initialisation
    (WAIT_FOR_PARENT) and another for initiating look up (PARENT_READY).

    Signed-off-by: David Howells
    Tested-By: Milosz Tanski
    Acked-by: Jeff Layton

    David Howells
     
  • Don't sleep in __fscache_maybe_release_page() if __GFP_FS is not set. This
    goes some way towards mitigating fscache deadlocking against ext4 by way of
    the allocator, eg:

    INFO: task flush-8:0:24427 blocked for more than 120 seconds.
    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    flush-8:0 D ffff88003e2b9fd8 0 24427 2 0x00000000
    ffff88003e2b9138 0000000000000046 ffff880012e3a040 ffff88003e2b9fd8
    0000000000011c80 ffff88003e2b9fd8 ffffffff81a10400 ffff880012e3a040
    0000000000000002 ffff880012e3a040 ffff88003e2b9098 ffffffff8106dcf5
    Call Trace:
    [] ? __lock_is_held+0x31/0x53
    [] ? radix_tree_lookup_element+0xf4/0x12a
    [] schedule+0x60/0x62
    [] __fscache_wait_on_page_write+0x8b/0xa5 [fscache]
    [] ? __init_waitqueue_head+0x4d/0x4d
    [] __fscache_maybe_release_page+0x30c/0x324 [fscache]
    [] ? __fscache_maybe_release_page+0x6c/0x324 [fscache]
    [] ? trace_hardirqs_on_caller+0x114/0x170
    [] nfs_fscache_release_page+0x68/0x94 [nfs]
    [] nfs_release_page+0x7e/0x86 [nfs]
    [] try_to_release_page+0x32/0x3b
    [] shrink_page_list+0x535/0x71a
    [] ? trace_hardirqs_on_caller+0x114/0x170
    [] shrink_inactive_list+0x20a/0x2dd
    [] ? mark_held_locks+0xbe/0xea
    [] shrink_lruvec+0x34c/0x3eb
    [] do_try_to_free_pages+0xcf/0x355
    [] try_to_free_pages+0x9a/0xa1
    [] __alloc_pages_nodemask+0x494/0x6f7
    [] kmem_getpages+0x58/0x155
    [] fallback_alloc+0x120/0x1f3
    [] ? trace_hardirqs_off+0xd/0xf
    [] ____cache_alloc_node+0x177/0x186
    [] ? ext4_init_io_end+0x1c/0x37
    [] kmem_cache_alloc+0xf1/0x176
    [] ? test_set_page_writeback+0x101/0x113
    [] ext4_init_io_end+0x1c/0x37
    [] ext4_bio_write_page+0x20f/0x3af
    [] mpage_da_submit_io+0x26e/0x2f6
    [] ? __find_get_block_slow+0x38/0x133
    [] mpage_da_map_and_submit+0x3a7/0x3bd
    [] ext4_da_writepages+0x30d/0x426
    [] do_writepages+0x1c/0x2a
    [] __writeback_single_inode+0x3e/0xe5
    [] writeback_sb_inodes+0x1bd/0x2f4
    [] __writeback_inodes_wb+0x6f/0xb4
    [] wb_writeback+0x101/0x195
    [] ? trace_hardirqs_on_caller+0x114/0x170
    [] ? wb_do_writeback+0xaa/0x173
    [] wb_do_writeback+0x4a/0x173
    [] ? trace_hardirqs_on+0xd/0xf
    [] ? del_timer+0x4b/0x5b
    [] bdi_writeback_thread+0x6d/0x147
    [] ? wb_do_writeback+0x173/0x173
    [] kthread+0xd0/0xd8
    [] ? _raw_spin_unlock_irq+0x29/0x3e
    [] ? __init_kthread_worker+0x55/0x55
    [] ret_from_fork+0x7c/0xb0
    [] ? __init_kthread_worker+0x55/0x55
    2 locks held by flush-8:0/24427:
    #0: (&type->s_umount_key#41){.+.+..}, at: [] grab_super_passive+0x4c/0x76
    #1: (jbd2_handle){+.+...}, at: [] start_this_handle+0x475/0x4ea

    The problem here is that another thread, which is attempting to write the
    to-be-stored NFS page to the on-ext4 cache file is waiting for the journal
    lock, eg:

    INFO: task kworker/u:2:24437 blocked for more than 120 seconds.
    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    kworker/u:2 D ffff880039589768 0 24437 2 0x00000000
    ffff8800395896d8 0000000000000046 ffff8800283bf040 ffff880039589fd8
    0000000000011c80 ffff880039589fd8 ffff880039f0b040 ffff8800283bf040
    0000000000000006 ffff8800283bf6b8 ffff880039589658 ffffffff81071a13
    Call Trace:
    [] ? mark_held_locks+0xbe/0xea
    [] ? _raw_spin_unlock_irqrestore+0x3a/0x50
    [] ? trace_hardirqs_on_caller+0x114/0x170
    [] ? trace_hardirqs_on+0xd/0xf
    [] schedule+0x60/0x62
    [] start_this_handle+0x317/0x4ea
    [] ? __init_waitqueue_head+0x4d/0x4d
    [] jbd2__journal_start+0xb3/0x12e
    [] __ext4_journal_start_sb+0xb2/0xc6
    [] ext4_da_write_begin+0x109/0x233
    [] generic_file_buffered_write+0x11a/0x264
    [] ? __mark_inode_dirty+0x2d/0x1ee
    [] __generic_file_aio_write+0x2a5/0x2d5
    [] generic_file_aio_write+0x6f/0xd0
    [] ext4_file_write+0x38c/0x3c4
    [] do_sync_write+0x91/0xd1
    [] cachefiles_write_page+0x26f/0x310 [cachefiles]
    [] fscache_write_op+0x21e/0x37a [fscache]
    [] ? _raw_spin_unlock_irq+0x29/0x3e
    [] fscache_op_work_func+0x78/0xd7 [fscache]
    [] process_one_work+0x232/0x3a8
    [] ? process_one_work+0x1d7/0x3a8
    [] worker_thread+0x214/0x303
    [] ? manage_workers+0x245/0x245
    [] kthread+0xd0/0xd8
    [] ? _raw_spin_unlock_irq+0x29/0x3e
    [] ? __init_kthread_worker+0x55/0x55
    [] ret_from_fork+0x7c/0xb0
    [] ? __init_kthread_worker+0x55/0x55
    4 locks held by kworker/u:2/24437:
    #0: (fscache_operation){.+.+.+}, at: [] process_one_work+0x1d7/0x3a8
    #1: ((&op->work)){+.+.+.}, at: [] process_one_work+0x1d7/0x3a8
    #2: (sb_writers#14){.+.+.+}, at: [] generic_file_aio_write+0x51/0xd0
    #3: (&sb->s_type->i_mutex_key#19){+.+.+.}, at: [] generic_file_aio_write+0x5b/0x

    fscache already tries to cancel pending stores, but it can't cancel a write
    for which I/O is already in progress.

    An alternative would be to accept writing garbage to the cache under extreme
    circumstances and to kill the afflicted cache object if we have to do this.
    However, we really need to know how strapped the allocator is before deciding
    to do that.

    Signed-off-by: David Howells
    Tested-By: Milosz Tanski
    Acked-by: Jeff Layton

    David Howells
     
  • The spinlock() within the condition in while() will cause a compile error
    if it is not a function. This is not a problem on mainline but it does not
    look pretty and there is no reason to do it that way.
    That patch writes it a little differently and avoids the double condition.

    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: David Howells
    Tested-By: Milosz Tanski
    Acked-by: Jeff Layton

    Sebastian Andrzej Siewior
     

21 Dec, 2012

10 commits

  • Provide fscache_cancel_op() with a pointer to a function it should invoke under
    lock if it cancels an operation.

    Use this to clear the remaining page count upon cancellation of a pending
    retrieval operation so that fscache_release_retrieval_op() doesn't get an
    assertion failure (see below). This can happen when a signal occurs, say from
    CTRL-C being pressed during data retrieval.

    FS-Cache: Assertion failed
    3 == 0 is false
    ------------[ cut here ]------------
    kernel BUG at fs/fscache/page.c:237!
    invalid opcode: 0000 [#641] SMP
    Modules linked in: cachefiles(F) nfsv4(F) nfsv3(F) nfsv2(F) nfs(F) fscache(F) auth_rpcgss(F) nfs_acl(F) lockd(F) sunrpc(F)
    CPU 0
    Pid: 6075, comm: slurp-q Tainted: GF D 3.7.0-rc8-fsdevel+ #411 /DG965RY
    RIP: 0010:[] [] fscache_release_retrieval_op+0x75/0xff [fscache]
    RSP: 0000:ffff88001c6d7988 EFLAGS: 00010296
    RAX: 000000000000000f RBX: ffff880014cdfe00 RCX: ffffffff6c102000
    RDX: ffffffff8102d1ad RSI: ffffffff6c102000 RDI: ffffffff8102d1d6
    RBP: ffff88001c6d7998 R08: 0000000000000002 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000000 R12: 00000000fffffe00
    R13: ffff88001c6d7ab4 R14: ffff88001a8638a0 R15: ffff88001552b190
    FS: 00007f877aaf0700(0000) GS:ffff88003bc00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: 00007fff11378fd2 CR3: 000000001c6c6000 CR4: 00000000000007f0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Process slurp-q (pid: 6075, threadinfo ffff88001c6d6000, task ffff88001c6c4080)
    Stack:
    ffffffffa007ec07 ffff880014cdfe00 ffff88001c6d79c8 ffffffffa007db4d
    ffffffffa007ec07 ffff880014cdfe00 00000000fffffe00 ffff88001c6d7ab4
    ffff88001c6d7a38 ffffffffa008116d 0000000000000000 ffff88001c6c4080
    Call Trace:
    [] ? fscache_cancel_op+0x194/0x1cf [fscache]
    [] fscache_put_operation+0x135/0x2ed [fscache]
    [] ? fscache_cancel_op+0x194/0x1cf [fscache]
    [] __fscache_read_or_alloc_pages+0x413/0x4bc [fscache]
    [] ? __alloc_pages_nodemask+0x195/0x75c
    [] __nfs_readpages_from_fscache+0x86/0x13d [nfs]
    [] nfs_readpages+0x186/0x1bd [nfs]
    [] ? alloc_pages_current+0xc7/0xe4
    [] ? __page_cache_alloc+0x84/0x91
    [] ? __do_page_cache_readahead+0xa6/0x2e0
    [] __do_page_cache_readahead+0x237/0x2e0
    [] ? __do_page_cache_readahead+0xa6/0x2e0
    [] ra_submit+0x1c/0x20
    [] ondemand_readahead+0x359/0x382
    [] page_cache_sync_readahead+0x38/0x3a
    [] generic_file_aio_read+0x26b/0x637
    [] ? nfs_mark_delegation_referenced+0xb/0xb [nfsv4]
    [] nfs_file_read+0xaa/0xcf [nfs]
    [] do_sync_read+0x91/0xd1
    [] vfs_read+0x9b/0x144
    [] sys_read+0x44/0x75
    [] system_call_fastpath+0x16/0x1b

    Signed-off-by: David Howells

    David Howells
     
  • Mark as cancelled an operation that is in progress rather than pending at the
    time it is cancelled, and call fscache_complete_op() to cancel an operation so
    that blocked ops can be started.

    Signed-off-by: David Howells

    David Howells
     
  • In fscache_write_op(), if the object is determined to have become inactive or
    to have lost its cookie, we don't move the operation state from in-progress,
    and so an assertion in fscache_put_operation() fails with an assertion (see
    below).

    Instrumenting fscache_op_work_func() indicates that it called
    fscache_write_op() before calling fscache_put_operation() - where the assertion
    failed. The assertion at line 433 indicates that the operation state is
    IN_PROGRESS rather than being COMPLETE or CANCELLED.

    Instrumenting fscache_write_op() showed that it was being called on an object
    that had had its cookie removed and that this was due to relinquishment of the
    cookie by the netfs. At this point fscache no longer has access to the pages
    of netfs data that were requested to be written, and so simply cancelling the
    operation is the thing to do.

    FS-Cache: Assertion failed
    3 == 5 is false
    ------------[ cut here ]------------
    kernel BUG at fs/fscache/operation.c:433!
    invalid opcode: 0000 [#1] SMP
    Modules linked in: cachefiles(F) nfsv4(F) nfsv3(F) nfsv2(F) nfs(F) fscache(F) auth_rpcgss(F) nfs_acl(F) lockd(F) sunrpc(F)
    CPU 0
    Pid: 1035, comm: kworker/u:3 Tainted: GF 3.7.0-rc8-fsdevel+ #411 /DG965RY
    RIP: 0010:[] [] fscache_put_operation+0x11a/0x2ed [fscache]
    RSP: 0018:ffff88003e32bcf8 EFLAGS: 00010296
    RAX: 000000000000000f RBX: ffff88001818eb78 RCX: ffffffff6c102000
    RDX: ffffffff8102d1ad RSI: ffffffff6c102000 RDI: ffffffff8102d1d6
    RBP: ffff88003e32bd18 R08: 0000000000000002 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000000 R12: ffffffffa00811da
    R13: 0000000000000001 R14: 0000000100625d26 R15: 0000000000000000
    FS: 0000000000000000(0000) GS:ffff88003bc00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: 00007fff7dd31c68 CR3: 000000003d730000 CR4: 00000000000007f0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Process kworker/u:3 (pid: 1035, threadinfo ffff88003e32a000, task ffff88003bb38080)
    Stack:
    ffffffff8102d1ad ffff88001818eb78 ffffffffa00811da 0000000000000001
    ffff88003e32bd48 ffffffffa007f0ad ffff88001818eb78 ffffffff819583c0
    ffff88003df24e00 ffff88003882c3e0 ffff88003e32bde8 ffffffff81042de0
    Call Trace:
    [] ? vprintk_emit+0x3c6/0x41a
    [] ? __fscache_read_or_alloc_pages+0x4bc/0x4bc [fscache]
    [] fscache_op_work_func+0xec/0x123 [fscache]
    [] process_one_work+0x21c/0x3b0
    [] ? process_one_work+0x1be/0x3b0
    [] ? fscache_operation_gc+0x23e/0x23e [fscache]
    [] worker_thread+0x202/0x2df
    [] ? rescuer_thread+0x18e/0x18e
    [] kthread+0xd0/0xd8
    [] ? _raw_spin_unlock_irq+0x29/0x3e
    [] ? __init_kthread_worker+0x55/0x55
    [] ret_from_fork+0x7c/0xb0
    [] ? __init_kthread_worker+0x55/0x55

    Signed-off-by: David Howells

    David Howells
     
  • wait_on_bit() with TASK_INTERRUPTIBLE returns 1 rather than a negative error
    code, so change what we check for. This means that the signal handling in
    fscache_wait_for_retrieval_activation() should now work properly.

    Without this, the following bug can be seen if CTRL-C is pressed during
    fscache read operation:

    FS-Cache: Assertion failed
    2 == 3 is false
    ------------[ cut here ]------------
    kernel BUG at fs/fscache/page.c:347!
    invalid opcode: 0000 [#1] SMP
    Modules linked in: cachefiles(F) nfsv4(F) nfsv3(F) nfsv2(F) nfs(F) fscache(F) auth_rpcgss(F) nfs_acl(F) lockd(F) sunrpc(F)
    CPU 1
    Pid: 15006, comm: slurp-q Tainted: GF 3.7.0-rc8-fsdevel+ #411 /DG965RY
    RIP: 0010:[] [] fscache_wait_for_retrieval_activation+0x167/0x177 [fscache]
    RSP: 0018:ffff88002a4c39a8 EFLAGS: 00010292
    RAX: 000000000000001a RBX: ffff88002d3dc158 RCX: 0000000000008685
    RDX: ffffffff8102ccd6 RSI: 0000000000000001 RDI: ffffffff8102d1d6
    RBP: ffff88002a4c39c8 R08: 0000000000000002 R09: 0000000000000000
    R10: ffffffff8163afa0 R11: ffff88003bd11900 R12: ffffffffa00868c8
    R13: ffff880028306458 R14: ffff88002d3dc1b0 R15: ffff88001372e538
    FS: 00007f17426a0700(0000) GS:ffff88003bd00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: 00007f1742494a44 CR3: 0000000031bd7000 CR4: 00000000000007e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Process slurp-q (pid: 15006, threadinfo ffff88002a4c2000, task ffff880023de3040)
    Stack:
    ffff88002d3dc158 ffff88001372e538 ffff88002a4c3ab4 ffff8800283064e0
    ffff88002a4c3a38 ffffffffa0080f6d 0000000000000000 ffff880023de3040
    ffff88002a4c3ac8 ffffffff810ac8ae ffff880028306458 ffff88002a4c3bc8
    Call Trace:
    [] __fscache_read_or_alloc_pages+0x24f/0x4bc [fscache]
    [] ? __alloc_pages_nodemask+0x195/0x75c
    [] __nfs_readpages_from_fscache+0x86/0x13d [nfs]
    [] nfs_readpages+0x186/0x1bd [nfs]
    [] ? alloc_pages_current+0xc7/0xe4
    [] ? __page_cache_alloc+0x84/0x91
    [] ? __do_page_cache_readahead+0xa6/0x2e0
    [] __do_page_cache_readahead+0x237/0x2e0
    [] ? __do_page_cache_readahead+0xa6/0x2e0
    [] ra_submit+0x1c/0x20
    [] ondemand_readahead+0x359/0x382
    [] page_cache_sync_readahead+0x38/0x3a
    [] generic_file_aio_read+0x26b/0x637
    [] ? nfs_mark_delegation_referenced+0xb/0xb [nfsv4]
    [] nfs_file_read+0xaa/0xcf [nfs]
    [] do_sync_read+0x91/0xd1
    [] vfs_read+0x9b/0x144
    [] sys_read+0x44/0x75
    [] system_call_fastpath+0x16/0x1b

    Signed-off-by: David Howells

    David Howells
     
  • nfs_migrate_page() does not wait for FS-Cache to finish with a page, probably
    leading to the following bad-page-state:

    BUG: Bad page state in process python-bin pfn:17d39b
    page:ffffea00053649e8 flags:004000000000100c count:0 mapcount:0 mapping:(null)
    index:38686 (Tainted: G B ---------------- )
    Pid: 31053, comm: python-bin Tainted: G B ----------------
    2.6.32-71.24.1.el6.x86_64 #1
    Call Trace:
    [] bad_page+0x107/0x160
    [] free_hot_cold_page+0x1c9/0x220
    [] __pagevec_free+0x59/0xb0
    [] ? flush_tlb_others_ipi+0x128/0x130
    [] release_pages+0x21c/0x250
    [] ? remove_migration_pte+0x28a/0x2b0
    [] ? mem_cgroup_get_reclaim_stat_from_page+0x18/0x70
    [] ____pagevec_lru_add+0x167/0x180
    [] __lru_cache_add+0x58/0x70
    [] lru_cache_add_lru+0x21/0x40
    [] putback_lru_page+0x69/0x100
    [] migrate_pages+0x13d/0x5d0
    [] ? ____pagevec_lru_add+0x167/0x180
    [] ? compaction_alloc+0x0/0x370
    [] compact_zone+0x4cc/0x600
    [] ? get_page_from_freelist+0x15c/0x820
    [] ? check_preempt_wakeup+0x1c4/0x3c0
    [] compact_zone_order+0x7e/0xb0
    [] try_to_compact_pages+0x109/0x170
    [] __alloc_pages_nodemask+0x5ed/0x850
    [] ? thread_return+0x4e/0x778
    [] alloc_pages_vma+0x93/0x150
    [] do_huge_pmd_anonymous_page+0x135/0x340
    [] ? rwsem_down_read_failed+0x26/0x30
    [] handle_mm_fault+0x245/0x2b0
    [] do_page_fault+0x123/0x3a0
    [] page_fault+0x25/0x30

    nfs_migrate_page() calls nfs_fscache_release_page() which doesn't actually wait
    - even if __GFP_WAIT is set. The reason that doesn't wait is that
    fscache_maybe_release_page() might deadlock the allocator as the work threads
    writing to the cache may all end up sleeping on memory allocation.

    However, I wonder if that is actually a problem. There are a number of things
    I can do to deal with this:

    (1) Make nfs_migrate_page() wait.

    (2) Make fscache_maybe_release_page() honour the __GFP_WAIT flag.

    (3) Set a timeout around the wait.

    (4) Make nfs_migrate_page() return an error if the page is still busy.

    For the moment, I'll select (2) and (4).

    Signed-off-by: David Howells
    Acked-by: Jeff Layton

    David Howells
     
  • CacheFiles is missing some calls to fscache_retrieval_complete() in the error
    handling/collision paths of its reader functions.

    This can be seen by the following assertion tripping in fscache_put_operation()
    whereby the operation being destroyed is still in the in-progress state and has
    not been cancelled or completed:

    FS-Cache: Assertion failed
    3 == 5 is false
    ------------[ cut here ]------------
    kernel BUG at fs/fscache/operation.c:408!
    invalid opcode: 0000 [#1] SMP
    CPU 2
    Modules linked in: xfs ioatdma dca loop joydev evdev
    psmouse dcdbas pcspkr serio_raw i5000_edac edac_core i5k_amb shpchp
    pci_hotplug sg sr_mod]

    Pid: 8062, comm: httpd Not tainted 3.1.0-rc8 #1 Dell Inc. PowerEdge 1950/0DT097
    RIP: 0010:[] [] fscache_put_operation+0x304/0x330
    RSP: 0018:ffff880062f739d8 EFLAGS: 00010296
    RAX: 0000000000000025 RBX: ffff8800c5122e84 RCX: ffffffff81ddf040
    RDX: 00000000ffffffff RSI: 0000000000000082 RDI: ffffffff81ddef30
    RBP: ffff880062f739f8 R08: 0000000000000005 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000003 R12: ffff8800c5122e40
    R13: ffff880037a2cd20 R14: ffff880087c7a058 R15: ffff880087c7a000
    FS: 00007f63dcf636e0(0000) GS:ffff88022fc80000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007f0c0a91f000 CR3: 0000000062ec2000 CR4: 00000000000006e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Process httpd (pid: 8062, threadinfo ffff880062f72000, task ffff880087e58000)
    Stack:
    ffff880062f73bf8 0000000000000000 ffff880062f73bf8 ffff880037a2cd20
    ffff880062f73a68 ffffffff8119aa7e ffff88006540e000 ffff880062f73ad4
    ffff88008e9a4308 ffff880037a2cd20 ffff880062f73a48 ffff8800c5122e40
    Call Trace:
    [] __fscache_read_or_alloc_pages+0x1fe/0x530
    [] __nfs_readpages_from_fscache+0x70/0x1c0
    [] nfs_readpages+0xca/0x1e0
    [] ? rpc_do_put_task+0x36/0x50
    [] ? alloc_nfs_open_context+0x4b/0x110
    [] ? rpc_call_sync+0x5a/0x70
    [] __do_page_cache_readahead+0x1ca/0x270
    [] ra_submit+0x21/0x30
    [] ondemand_readahead+0x11d/0x250
    [] page_cache_sync_readahead+0x36/0x60
    [] generic_file_aio_read+0x454/0x770
    [] nfs_file_read+0xe1/0x130
    [] do_sync_read+0xd9/0x120
    [] ? mntput+0x1f/0x40
    [] ? fput+0x1cb/0x260
    [] vfs_read+0xc8/0x180
    [] sys_read+0x55/0x90

    Reported-by: Mark Moseley
    Signed-off-by: David Howells

    David Howells
     
  • Provide a proper invalidation method rather than relying on the netfs retiring
    the cookie it has and getting a new one. The problem with this is that isn't
    easy for the netfs to make sure that it has completed/cancelled all its
    outstanding storage and retrieval operations on the cookie it is retiring.

    Instead, have the cache provide an invalidation method that will cancel or wait
    for all currently outstanding operations before invalidating the cache, and
    will cause new operations to queue up behind that. Whilst invalidation is in
    progress, some requests will be rejected until the cache can stack a barrier on
    the operation queue to cause new operations to be deferred behind it.

    Signed-off-by: David Howells

    David Howells
     
  • Fix the state management of internal fscache operations and the accounting of
    what operations are in what states.

    This is done by:

    (1) Give struct fscache_operation a enum variable that directly represents the
    state it's currently in, rather than spreading this knowledge over a bunch
    of flags, who's processing the operation at the moment and whether it is
    queued or not.

    This makes it easier to write assertions to check the state at various
    points and to prevent invalid state transitions.

    (2) Add an 'operation complete' state and supply a function to indicate the
    completion of an operation (fscache_op_complete()) and make things call
    it. The final call to fscache_put_operation() can then check that an op
    in the appropriate state (complete or cancelled).

    (3) Adjust the use of object->n_ops, ->n_in_progress, ->n_exclusive to better
    govern the state of an object:

    (a) The ->n_ops is now the number of extant operations on the object
    and is now decremented by fscache_put_operation() only.

    (b) The ->n_in_progress is simply the number of objects that have been
    taken off of the object's pending queue for the purposes of being
    run. This is decremented by fscache_op_complete() only.

    (c) The ->n_exclusive is the number of exclusive ops that have been
    submitted and queued or are in progress. It is decremented by
    fscache_op_complete() and by fscache_cancel_op().

    fscache_put_operation() and fscache_operation_gc() now no longer try to
    clean up ->n_exclusive and ->n_in_progress. That was leading to double
    decrements against fscache_cancel_op().

    fscache_cancel_op() now no longer decrements ->n_ops. That was leading to
    double decrements against fscache_put_operation().

    fscache_submit_exclusive_op() now decides whether it has to queue an op
    based on ->n_in_progress being > 0 rather than ->n_ops > 0 as the latter
    will persist in being true even after all preceding operations have been
    cancelled or completed. Furthermore, if an object is active and there are
    runnable ops against it, there must be at least one op running.

    (4) Add a remaining-pages counter (n_pages) to struct fscache_retrieval and
    provide a function to record completion of the pages as they complete.

    When n_pages reaches 0, the operation is deemed to be complete and
    fscache_op_complete() is called.

    Add calls to fscache_retrieval_complete() anywhere we've finished with a
    page we've been given to read or allocate for. This includes places where
    we just return pages to the netfs for reading from the server and where
    accessing the cache fails and we discard the proposed netfs page.

    The bugs in the unfixed state management manifest themselves as oopses like the
    following where the operation completion gets out of sync with return of the
    cookie by the netfs. This is possible because the cache unlocks and returns
    all the netfs pages before recording its completion - which means that there's
    nothing to stop the netfs discarding them and returning the cookie.

    FS-Cache: Cookie 'NFS.fh' still has outstanding reads
    ------------[ cut here ]------------
    kernel BUG at fs/fscache/cookie.c:519!
    invalid opcode: 0000 [#1] SMP
    CPU 1
    Modules linked in: cachefiles nfs fscache auth_rpcgss nfs_acl lockd sunrpc

    Pid: 400, comm: kswapd0 Not tainted 3.1.0-rc7-fsdevel+ #1090 /DG965RY
    RIP: 0010:[] [] __fscache_relinquish_cookie+0x170/0x343 [fscache]
    RSP: 0018:ffff8800368cfb00 EFLAGS: 00010282
    RAX: 000000000000003c RBX: ffff880023cc8790 RCX: 0000000000000000
    RDX: 0000000000002f2e RSI: 0000000000000001 RDI: ffffffff813ab86c
    RBP: ffff8800368cfb50 R08: 0000000000000002 R09: 0000000000000000
    R10: ffff88003a1b7890 R11: ffff88001df6e488 R12: ffff880023d8ed98
    R13: ffff880023cc8798 R14: 0000000000000004 R15: ffff88003b8bf370
    FS: 0000000000000000(0000) GS:ffff88003bd00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: 00000000008ba008 CR3: 0000000023d93000 CR4: 00000000000006e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Process kswapd0 (pid: 400, threadinfo ffff8800368ce000, task ffff88003b8bf040)
    Stack:
    ffff88003b8bf040 ffff88001df6e528 ffff88001df6e528 ffffffffa00b46b0
    ffff88003b8bf040 ffff88001df6e488 ffff88001df6e620 ffffffffa00b46b0
    ffff88001ebd04c8 0000000000000004 ffff8800368cfb70 ffffffffa00b2c91
    Call Trace:
    [] nfs_fscache_release_inode_cookie+0x3b/0x47 [nfs]
    [] nfs_clear_inode+0x3c/0x41 [nfs]
    [] nfs4_evict_inode+0x2f/0x33 [nfs]
    [] evict+0xa1/0x15c
    [] dispose_list+0x2c/0x38
    [] prune_icache_sb+0x28c/0x29b
    [] prune_super+0xd5/0x140
    [] shrink_slab+0x102/0x1ab
    [] balance_pgdat+0x2f2/0x595
    [] ? process_timeout+0xb/0xb
    [] kswapd+0x270/0x289
    [] ? __init_waitqueue_head+0x46/0x46
    [] ? balance_pgdat+0x595/0x595
    [] kthread+0x7f/0x87
    [] kernel_thread_helper+0x4/0x10
    [] ? finish_task_switch+0x45/0xc0
    [] ? retint_restore_args+0xe/0xe
    [] ? __init_kthread_worker+0x53/0x53
    [] ? gs_change+0xb/0xb

    Signed-off-by: David Howells

    David Howells
     
  • Downgrade the requirements passed to the allocator in the gfp flags parameter.
    FS-Cache/CacheFiles can handle OOM conditions simply by aborting the attempt to
    store an object or a page in the cache.

    Signed-off-by: David Howells

    David Howells
     
  • Under some circumstances CacheFiles defers the marking of pages with PG_fscache
    so that it can take advantage of pagevecs to reduce the number of calls to
    fscache_mark_pages_cached() and the netfs's hook to keep track of this.

    There are, however, two problems with this:

    (1) It can lead to the PG_fscache mark being applied _after_ the page is set
    PG_uptodate and unlocked (by the call to fscache_end_io()).

    (2) CacheFiles's ref on the page is dropped immediately following
    fscache_end_io() - and so may not still be held when the mark is applied.
    This can lead to the page being passed back to the allocator before the
    mark is applied.

    Fix this by, where appropriate, marking the page before calling
    fscache_end_io() and releasing the page. This means that we can't take
    advantage of pagevecs and have to make a separate call for each page to the
    marking routines.

    The symptoms of this are Bad Page state errors cropping up under memory
    pressure, for example:

    BUG: Bad page state in process tar pfn:002da
    page:ffffea0000009fb0 count:0 mapcount:0 mapping: (null) index:0x1447
    page flags: 0x1000(private_2)
    Pid: 4574, comm: tar Tainted: G W 3.1.0-rc4-fsdevel+ #1064
    Call Trace:
    [] ? dump_page+0xb9/0xbe
    [] bad_page+0xd5/0xea
    [] get_page_from_freelist+0x35b/0x46a
    [] __alloc_pages_nodemask+0x362/0x662
    [] __do_page_cache_readahead+0x13a/0x267
    [] ? __do_page_cache_readahead+0xa2/0x267
    [] ra_submit+0x1c/0x20
    [] ondemand_readahead+0x28b/0x29a
    [] ? ondemand_readahead+0x163/0x29a
    [] page_cache_sync_readahead+0x38/0x3a
    [] generic_file_aio_read+0x2ab/0x67e
    [] nfs_file_read+0xa4/0xc9 [nfs]
    [] do_sync_read+0xba/0xfa
    [] ? security_file_permission+0x7b/0x84
    [] ? rw_verify_area+0xab/0xc8
    [] vfs_read+0xaa/0x13a
    [] sys_read+0x45/0x6c
    [] system_call_fastpath+0x16/0x1b

    As can be seen, PG_private_2 (== PG_fscache) is set in the page flags.

    Instrumenting fscache_mark_pages_cached() to verify whether page->mapping was
    set appropriately showed that sometimes it wasn't. This led to the discovery
    that sometimes the page has apparently been reclaimed by the time the marker
    got to see it.

    Reported-by: M. Stevens
    Signed-off-by: David Howells
    Reviewed-by: Jeff Layton

    David Howells
     

22 Jul, 2011

1 commit


08 Jul, 2011

1 commit

  • Add an FS-Cache helper to bulk uncache pages on an inode. This will
    only work for the circumstance where the pages in the cache correspond
    1:1 with the pages attached to an inode's page cache.

    This is required for CIFS and NFS: When disabling inode cookie, we were
    returning the cookie and setting cifsi->fscache to NULL but failed to
    invalidate any previously mapped pages. This resulted in "Bad page
    state" errors and manifested in other kind of errors when running
    fsstress. Fix it by uncaching mapped pages when we disable the inode
    cookie.

    This patch should fix the following oops and "Bad page state" errors
    seen during fsstress testing.

    ------------[ cut here ]------------
    kernel BUG at fs/cachefiles/namei.c:201!
    invalid opcode: 0000 [#1] SMP
    Pid: 5, comm: kworker/u:0 Not tainted 2.6.38.7-30.fc15.x86_64 #1 Bochs Bochs
    RIP: 0010: cachefiles_walk_to_object+0x436/0x745 [cachefiles]
    RSP: 0018:ffff88002ce6dd00 EFLAGS: 00010282
    RAX: ffff88002ef165f0 RBX: ffff88001811f500 RCX: 0000000000000000
    RDX: 0000000000000000 RSI: 0000000000000100 RDI: 0000000000000282
    RBP: ffff88002ce6dda0 R08: 0000000000000100 R09: ffffffff81b3a300
    R10: 0000ffff00066c0a R11: 0000000000000003 R12: ffff88002ae54840
    R13: ffff88002ae54840 R14: ffff880029c29c00 R15: ffff88001811f4b0
    FS: 00007f394dd32720(0000) GS:ffff88002ef00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: 00007fffcb62ddf8 CR3: 000000001825f000 CR4: 00000000000006e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Process kworker/u:0 (pid: 5, threadinfo ffff88002ce6c000, task ffff88002ce55cc0)
    Stack:
    0000000000000246 ffff88002ce55cc0 ffff88002ce6dd58 ffff88001815dc00
    ffff8800185246c0 ffff88001811f618 ffff880029c29d18 ffff88001811f380
    ffff88002ce6dd50 ffffffff814757e4 ffff88002ce6dda0 ffffffff8106ac56
    Call Trace:
    cachefiles_lookup_object+0x78/0xd4 [cachefiles]
    fscache_lookup_object+0x131/0x16d [fscache]
    fscache_object_work_func+0x1bc/0x669 [fscache]
    process_one_work+0x186/0x298
    worker_thread+0xda/0x15d
    kthread+0x84/0x8c
    kernel_thread_helper+0x4/0x10
    RIP cachefiles_walk_to_object+0x436/0x745 [cachefiles]
    ---[ end trace 1d481c9af1804caa ]---

    I tested the uncaching by the following means:

    (1) Create a big file on my NFS server (104857600 bytes).

    (2) Read the file into the cache with md5sum on the NFS client. Look in
    /proc/fs/fscache/stats:

    Pages : mrk=25601 unc=0

    (3) Open the file for read/write ("bash 5<>/warthog/bigfile"). Look in proc
    again:

    Pages : mrk=25601 unc=25601

    Reported-by: Jeff Layton
    Signed-off-by: David Howells
    Reviewed-and-Tested-by: Suresh Jayaraman
    cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    David Howells
     

25 May, 2011

1 commit


23 Jul, 2010

1 commit

  • Make fscache operation to use only workqueue instead of combination of
    workqueue and slow-work. FSCACHE_OP_SLOW is dropped and
    FSCACHE_OP_FAST is renamed to FSCACHE_OP_ASYNC and uses newly added
    fscache_op_wq workqueue to execute op->processor().
    fscache_operation_init_slow() is dropped and fscache_operation_init()
    now takes @processor argument directly.

    * Unbound workqueue is used.

    * fscache_retrieval_work() is no longer necessary as OP_ASYNC now does
    the equivalent thing.

    * sysctl fscache.operation_max_active added to control concurrency.
    The default value is nr_cpus clamped between 2 and
    WQ_UNBOUND_MAX_ACTIVE.

    * debugfs support is dropped for now. Tracing API based debug
    facility is planned to be added.

    Signed-off-by: Tejun Heo
    Acked-by: David Howells

    Tejun Heo