18 Feb, 2014

1 commit

  • When FS-Cache allocates an object, the following sequence of events can
    occur:

    -->fscache_alloc_object()
    -->cachefiles_alloc_object() [via cache->ops->alloc_object]
    fscache_attach_object()
    cachefiles_put_object() [via cache->ops->put_object]
    -->fscache_object_destroy()
    -->fscache_objlist_remove()
    -->rb_erase() to remove the object from fscache_object_list.

    resulting in a crash in the rbtree code.

    The problem is that the object is only added to fscache_object_list on
    the success path of fscache_attach_object() where it calls
    fscache_objlist_add().

    So if fscache_attach_object() fails, the object won't have been added to
    the objlist rbtree. We do, however, unconditionally try to remove the
    object from the tree.

    Thanks to NeilBrown for finding this and suggesting this solution.

    Reported-by: NeilBrown
    Signed-off-by: David Howells
    Tested-by: (a customer of) NeilBrown
    Signed-off-by: Linus Torvalds

    David Howells
     

14 Nov, 2013

1 commit

  • Pull block IO core updates from Jens Axboe:
    "This is the pull request for the core changes in the block layer for
    3.13. It contains:

    - The new blk-mq request interface.

    This is a new and more scalable queueing model that marries the
    best part of the request based interface we currently have (which
    is fully featured, but scales poorly) and the bio based "interface"
    which the new drivers for high IOPS devices end up using because
    it's much faster than the request based one.

    The bio interface has no block layer support, since it taps into
    the stack much earlier. This means that drivers end up having to
    implement a lot of functionality on their own, like tagging,
    timeout handling, requeue, etc. The blk-mq interface provides all
    these. Some drivers even provide a switch to select bio or rq and
    has code to handle both, since things like merging only works in
    the rq model and hence is faster for some workloads. This is a
    huge mess. Conversion of these drivers nets us a substantial code
    reduction. Initial results on converting SCSI to this model even
    shows an 8x improvement on single queue devices. So while the
    model was intended to work on the newer multiqueue devices, it has
    substantial improvements for "classic" hardware as well. This code
    has gone through extensive testing and development, it's now ready
    to go. A pull request is coming to convert virtio-blk to this
    model will be will be coming as well, with more drivers scheduled
    for 3.14 conversion.

    - Two blktrace fixes from Jan and Chen Gang.

    - A plug merge fix from Alireza Haghdoost.

    - Conversion of __get_cpu_var() from Christoph Lameter.

    - Fix for sector_div() with 64-bit divider from Geert Uytterhoeven.

    - A fix for a race between request completion and the timeout
    handling from Jeff Moyer. This is what caused the merge conflict
    with blk-mq/core, in case you are looking at that.

    - A dm stacking fix from Mike Snitzer.

    - A code consolidation fix and duplicated code removal from Kent
    Overstreet.

    - A handful of block bug fixes from Mikulas Patocka, fixing a loop
    crash and memory corruption on blk cg.

    - Elevator switch bug fix from Tomoki Sekiyama.

    A heads-up that I had to rebase this branch. Initially the immutable
    bio_vecs had been queued up for inclusion, but a week later, it became
    clear that it wasn't fully cooked yet. So the decision was made to
    pull this out and postpone it until 3.14. It was a straight forward
    rebase, just pruning out the immutable series and the later fixes of
    problems with it. The rest of the patches applied directly and no
    further changes were made"

    * 'for-3.13/core' of git://git.kernel.dk/linux-block: (31 commits)
    block: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO
    block: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO
    block: Do not call sector_div() with a 64-bit divisor
    kernel: trace: blktrace: remove redundent memcpy() in compat_blk_trace_setup()
    block: Consolidate duplicated bio_trim() implementations
    block: Use rw_copy_check_uvector()
    block: Enable sysfs nomerge control for I/O requests in the plug list
    block: properly stack underlying max_segment_size to DM device
    elevator: acquire q->sysfs_lock in elevator_change()
    elevator: Fix a race in elevator switching and md device initialization
    block: Replace __get_cpu_var uses
    bdi: test bdi_init failure
    block: fix a probe argument to blk_register_region
    loop: fix crash if blk_alloc_queue fails
    blk-core: Fix memory corruption if blkcg_init_queue fails
    block: fix race between request completion and timeout handling
    blktrace: Send BLK_TN_PROCESS events to all running traces
    blk-mq: don't disallow request merges for req->special being set
    blk-mq: mq plug list breakage
    blk-mq: fix for flush deadlock
    ...

    Linus Torvalds
     

08 Nov, 2013

1 commit

  • __get_cpu_var() is used for multiple purposes in the kernel source. One of
    them is address calculation via the form &__get_cpu_var(x). This calculates
    the address for the instance of the percpu variable of the current processor
    based on an offset.

    Other use cases are for storing and retrieving data from the current
    processors percpu area. __get_cpu_var() can be used as an lvalue when
    writing data or on the right side of an assignment.

    __get_cpu_var() is defined as :

    #define __get_cpu_var(var) (*this_cpu_ptr(&(var)))

    __get_cpu_var() always only does an address determination. However, store
    and retrieve operations could use a segment prefix (or global register on
    other platforms) to avoid the address calculation.

    this_cpu_write() and this_cpu_read() can directly take an offset into a
    percpu area and use optimized assembly code to read and write per cpu
    variables.

    This patch converts __get_cpu_var into either an explicit address
    calculation using this_cpu_ptr() or into a use of this_cpu operations that
    use the offset. Thereby address calculations are avoided and less registers
    are used when code is generated.

    At the end of the patch set all uses of __get_cpu_var have been removed so
    the macro is removed too.

    The patch set includes passes over all arches as well. Once these operations
    are used throughout then specialized macros can be defined in non -x86
    arches as well in order to optimize per cpu access by f.e. using a global
    register that may be set to the per cpu base.

    Transformations done to __get_cpu_var()

    1. Determine the address of the percpu instance of the current processor.

    DEFINE_PER_CPU(int, y);
    int *x = &__get_cpu_var(y);

    Converts to

    int *x = this_cpu_ptr(&y);

    2. Same as #1 but this time an array structure is involved.

    DEFINE_PER_CPU(int, y[20]);
    int *x = __get_cpu_var(y);

    Converts to

    int *x = this_cpu_ptr(y);

    3. Retrieve the content of the current processors instance of a per cpu
    variable.

    DEFINE_PER_CPU(int, y);
    int x = __get_cpu_var(y)

    Converts to

    int x = __this_cpu_read(y);

    4. Retrieve the content of a percpu struct

    DEFINE_PER_CPU(struct mystruct, y);
    struct mystruct x = __get_cpu_var(y);

    Converts to

    memcpy(&x, this_cpu_ptr(&y), sizeof(x));

    5. Assignment to a per cpu variable

    DEFINE_PER_CPU(int, y)
    __get_cpu_var(y) = x;

    Converts to

    this_cpu_write(y, x);

    6. Increment/Decrement etc of a per cpu variable

    DEFINE_PER_CPU(int, y);
    __get_cpu_var(y)++

    Converts to

    this_cpu_inc(y)

    Signed-off-by: Christoph Lameter
    Signed-off-by: Jens Axboe

    Christoph Lameter
     

28 Sep, 2013

2 commits

  • Provide the ability to enable and disable fscache cookies. A disabled cookie
    will reject or ignore further requests to:

    Acquire a child cookie
    Invalidate and update backing objects
    Check the consistency of a backing object
    Allocate storage for backing page
    Read backing pages
    Write to backing pages

    but still allows:

    Checks/waits on the completion of already in-progress objects
    Uncaching of pages
    Relinquishment of cookies

    Two new operations are provided:

    (1) Disable a cookie:

    void fscache_disable_cookie(struct fscache_cookie *cookie,
    bool invalidate);

    If the cookie is not already disabled, this locks the cookie against other
    dis/enablement ops, marks the cookie as being disabled, discards or
    invalidates any backing objects and waits for cessation of activity on any
    associated object.

    This is a wrapper around a chunk split out of fscache_relinquish_cookie(),
    but it reinitialises the cookie such that it can be reenabled.

    All possible failures are handled internally. The caller should consider
    calling fscache_uncache_all_inode_pages() afterwards to make sure all page
    markings are cleared up.

    (2) Enable a cookie:

    void fscache_enable_cookie(struct fscache_cookie *cookie,
    bool (*can_enable)(void *data),
    void *data)

    If the cookie is not already enabled, this locks the cookie against other
    dis/enablement ops, invokes can_enable() and, if the cookie is not an
    index cookie, will begin the procedure of acquiring backing objects.

    The optional can_enable() function is passed the data argument and returns
    a ruling as to whether or not enablement should actually be permitted to
    begin.

    All possible failures are handled internally. The cookie will only be
    marked as enabled if provisional backing objects are allocated.

    A later patch will introduce these to NFS. Cookie enablement during nfs_open()
    is then contingent on i_writecount <dhowells@redhat.com

    David Howells
     
  • Add wrapper functions for dealing with cookie->n_active:

    (*) __fscache_use_cookie() to increment it.

    (*) __fscache_unuse_cookie() to decrement and test against zero.

    (*) __fscache_wake_unused_cookie() to wake up anyone waiting for it to reach
    zero.

    The second and third are split so that the third can be done after cookie->lock
    has been released in case the waiter wakes up whilst we're still holding it and
    tries to get it.

    We will need to wake-on-zero once the cookie disablement patch is applied
    because it will then be possible to see n_active become zero without the cookie
    being relinquished.

    Also move the cookie usement out of fscache_attr_changed_op() and into
    fscache_attr_changed() and the operation struct so that cookie disablement
    will be able to track it.

    Whilst we're at it, only increment n_active if we're about to do
    fscache_submit_op() so that we don't have to deal with undoing it if anything
    earlier fails. Possibly this should be moved into fscache_submit_op() which
    could look at FSCACHE_OP_UNUSE_COOKIE.

    Signed-off-by: David Howells

    David Howells
     

20 Sep, 2013

1 commit

  • Pull ceph fixes from Sage Weil:
    "These fix several bugs with RBD from 3.11 that didn't get tested in
    time for the merge window: some error handling, a use-after-free, and
    a sequencing issue when unmapping and image races with a notify
    operation.

    There is also a patch fixing a problem with the new ceph + fscache
    code that just went in"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
    fscache: check consistency does not decrement refcount
    rbd: fix error handling from rbd_snap_name()
    rbd: ignore unmapped snapshots that no longer exist
    rbd: fix use-after free of rbd_dev->disk
    rbd: make rbd_obj_notify_ack() synchronous
    rbd: complete notifies before cleaning up osd_client and rbd_dev
    libceph: add function to ensure notifies are complete

    Linus Torvalds
     

12 Sep, 2013

1 commit

  • With users of radix_tree_preload() run from interrupt (block/blk-ioc.c is
    one such possible user), the following race can happen:

    radix_tree_preload()
    ...
    radix_tree_insert()
    radix_tree_node_alloc()
    if (rtp->nr) {
    ret = rtp->nodes[rtp->nr - 1];

    ...
    radix_tree_preload()
    ...
    radix_tree_insert()
    radix_tree_node_alloc()
    if (rtp->nr) {
    ret = rtp->nodes[rtp->nr - 1];

    And we give out one radix tree node twice. That clearly results in radix
    tree corruption with different results (usually OOPS) depending on which
    two users of radix tree race.

    We fix the problem by making radix_tree_node_alloc() always allocate fresh
    radix tree nodes when in interrupt. Using preloading when in interrupt
    doesn't make sense since all the allocations have to be atomic anyway and
    we cannot steal nodes from process-context users because some users rely
    on radix_tree_insert() succeeding after radix_tree_preload().
    in_interrupt() check is somewhat ugly but we cannot simply key off passed
    gfp_mask as that is acquired from root_gfp_mask() and thus the same for
    all preload users.

    Another part of the fix is to avoid node preallocation in
    radix_tree_preload() when passed gfp_mask doesn't allow waiting. Again,
    preallocation in such case doesn't make sense and when preallocation would
    happen in interrupt we could possibly leak some allocated nodes. However,
    some users of radix_tree_preload() require following radix_tree_insert()
    to succeed. To avoid unexpected effects for these users,
    radix_tree_preload() only warns if passed gfp mask doesn't allow waiting
    and we provide a new function radix_tree_maybe_preload() for those users
    which get different gfp mask from different call sites and which are
    prepared to handle radix_tree_insert() failure.

    Signed-off-by: Jan Kara
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

11 Sep, 2013

1 commit

  • __fscache_check_consistency() does not decrement the count of operations
    active after it finishes in the success case. This leads to a hung tasks on
    cookie de-registration (commonly in inode eviction).

    INFO: task kworker/1:2:4214 blocked for more than 120 seconds.
    kworker/1:2 D ffff880443513fc0 0 4214 2 0x00000000
    Workqueue: ceph-msgr con_work [libceph]
    ...
    Call Trace:
    [] ? _raw_spin_unlock_irqrestore+0x16/0x20
    [] ? fscache_wait_bit_interruptible+0x30/0x30 [fscache]
    [] schedule+0x29/0x70
    [] fscache_wait_atomic_t+0xe/0x20 [fscache]
    [] out_of_line_wait_on_atomic_t+0x9f/0xe0
    [] ? autoremove_wake_function+0x40/0x40
    [] __fscache_relinquish_cookie+0x15c/0x310 [fscache]
    [] ceph_fscache_unregister_inode_cookie+0x3e/0x50 [ceph]
    [] ceph_destroy_inode+0x33/0x200 [ceph]
    [] ? __fsnotify_inode_delete+0xe/0x10
    [] destroy_inode+0x3c/0x70
    [] evict+0x119/0x1b0

    Signed-off-by: Milosz Tanski
    Acked-by: David Howells
    Signed-off-by: Sage Weil

    Milosz Tanski
     

06 Sep, 2013

2 commits

  • Currently the fscache code expect the netfs to call fscache_readpages_or_alloc
    inside the aops readpages callback. It marks all the pages in the list
    provided by readahead with PG_private_2. In the cases that the netfs fails to
    read all the pages (which is legal) it ends up returning to the readahead and
    triggering a BUG. This happens because the page list still contains marked
    pages.

    This patch implements a simple fscache_readpages_cancel function that the netfs
    should call before returning from readpages. It will revoke the pages from the
    underlying cache backend and unmark them.

    The problem was originally worked out in the Ceph devel tree, but it also
    occurs in CIFS. It appears that NFS, AFS and 9P are okay as read_cache_pages()
    will clean up the unprocessed pages in the case of an error.

    This can be used to address the following oops:

    [12410647.597278] BUG: Bad page state in process petabucket pfn:3d504e
    [12410647.597292] page:ffffea000f541380 count:0 mapcount:0 mapping:
    (null) index:0x0
    [12410647.597298] page flags: 0x200000000001000(private_2)

    ...

    [12410647.597334] Call Trace:
    [12410647.597345] [] dump_stack+0x19/0x1b
    [12410647.597356] [] bad_page+0xc7/0x120
    [12410647.597359] [] free_pages_prepare+0x10e/0x120
    [12410647.597361] [] free_hot_cold_page+0x40/0x170
    [12410647.597363] [] __put_single_page+0x27/0x30
    [12410647.597365] [] put_page+0x25/0x40
    [12410647.597376] [] ceph_readpages+0x2e9/0x6e0 [ceph]
    [12410647.597379] [] __do_page_cache_readahead+0x1af/0x260
    [12410647.597382] [] ra_submit+0x21/0x30
    [12410647.597384] [] filemap_fault+0x254/0x490
    [12410647.597387] [] __do_fault+0x6f/0x4e0
    [12410647.597391] [] ? __switch_to+0x16d/0x4a0
    [12410647.597395] [] ? finish_task_switch+0x5a/0xc0
    [12410647.597398] [] handle_pte_fault+0xf6/0x930
    [12410647.597401] [] ? pte_mfn_to_pfn+0x93/0x110
    [12410647.597403] [] ? xen_pmd_val+0xe/0x10
    [12410647.597405] [] ? __raw_callee_save_xen_pmd_val+0x11/0x1e
    [12410647.597407] [] handle_mm_fault+0x251/0x370
    [12410647.597411] [] ? call_rwsem_down_read_failed+0x14/0x30
    [12410647.597414] [] __do_page_fault+0x1aa/0x550
    [12410647.597418] [] ? up_write+0x1d/0x20
    [12410647.597422] [] ? vm_mmap_pgoff+0xbc/0xe0
    [12410647.597425] [] ? SyS_mmap_pgoff+0xd8/0x240
    [12410647.597427] [] do_page_fault+0xe/0x10
    [12410647.597431] [] page_fault+0x28/0x30

    Signed-off-by: Milosz Tanski
    Signed-off-by: David Howells

    Milosz Tanski
     
  • Extend the fscache netfs API so that the netfs can ask as to whether a cache
    object is up to date with respect to its corresponding netfs object:

    int fscache_check_consistency(struct fscache_cookie *cookie)

    This will call back to the netfs to check whether the auxiliary data associated
    with a cookie is correct. It returns 0 if it is and -ESTALE if it isn't; it
    may also return -ENOMEM and -ERESTARTSYS.

    The backends now have to implement a mandatory operation pointer:

    int (*check_consistency)(struct fscache_object *object)

    that corresponds to the above API call. FS-Cache takes care of pinning the
    object and the cookie in memory and managing this call with respect to the
    object state.

    Original-author: Hongyi Jia
    Signed-off-by: David Howells
    cc: Hongyi Jia
    cc: Milosz Tanski

    David Howells
     

19 Jun, 2013

8 commits

  • Under certain circumstances, spin_is_locked() is hardwired to 0 - even when the
    code would normally be in a locked section where it should return 1. This
    means it cannot be used for an assertion that checks that a spinlock is locked.

    Remove such usages from FS-Cache.

    The following oops might otherwise be observed:

    FS-Cache: Assertion failed
    BUG: failure at fs/fscache/operation.c:270/fscache_start_operations()!
    Kernel panic - not syncing: BUG!
    CPU: 0 PID: 10 Comm: kworker/u2:1 Not tainted 3.10.0-rc1-00133-ge7ebb75 #2
    Workqueue: fscache_operation fscache_op_work_func [fscache]
    7f091c48 603c8947 7f090000 7f9b1361 7f25f080 00000001 7f26d440 7f091c90
    60299eb8 7f091d90 602951c5 7f26d440 3000000008 7f091da0 7f091cc0 7f091cd0
    00000007 00000007 00000006 7f091ae0 00000010 0000010e 7f9af330 7f091ae0
    Call Trace:
    7f091c88: [] dump_stack+0x17/0x19
    7f091c98: [] panic+0xf4/0x1e9
    7f091d38: [] set_signals+0x1e/0x40
    7f091d58: [] __wake_up+0x4e/0x70
    7f091d98: [] fscache_start_operations+0x43/0x50 [fscache]
    7f091da8: [] fscache_op_complete+0x1d3/0x220 [fscache]
    7f091db8: [] unlock_page+0x55/0x60
    7f091de8: [] cachefiles_read_copier+0x250/0x330 [cachefiles]
    7f091e58: [] fscache_op_work_func+0xac/0x120 [fscache]
    7f091e88: [] process_one_work+0x250/0x3a0
    7f091ef8: [] worker_thread+0x177/0x2a0
    7f091f38: [] worker_thread+0x0/0x2a0
    7f091f58: [] kthread+0xd8/0xe0
    7f091f68: [] finish_task_switch.isra.64+0x37/0xa0
    7f091fd8: [] new_thread_handler+0x8f/0xb0

    Reported-by: Milosz Tanski
    Signed-off-by: David Howells
    Reviewed-and-tested-By: Milosz Tanski

    David Howells
     
  • struct fscache_retrieval contains a count of the number of pages that still
    need some processing (n_pages). This is decremented as the pages are
    processed.

    However, this needs to be atomic as fscache_retrieval_complete() (I think) just
    occasionally may be called from cachefiles_read_backing_file() and
    cachefiles_read_copier() simultaneously.

    This happens when an fscache_read_or_alloc_pages() request containing a lot of
    pages (say a couple of hundred) is being processed. The read on each backing
    page is dispatched individually because we need to insert a monitor into the
    waitqueue to catch when the read completes. However, under low-memory
    conditions, we might be forced to wait in the allocator - and this gives the
    I/O on the backing page a chance to complete first.

    When the I/O completes, fscache_enqueue_retrieval() chucks the retrieval onto
    the workqueue without waiting for the operation to finish the initial I/O
    dispatch (we want to release any pages we can as soon as we can), thus both can
    end up running simultaneously and potentially attempting to partially complete
    the retrieval simultaneously (ENOMEM may occur, backing pages may already be in
    the page cache).

    This was demonstrated by parallelling the non-atomic counter with an atomic
    counter and printing both of them when the assertion fails. At this point, the
    atomic counter has reached zero, but the non-atomic counter has not.

    To fix this, make the counter an atomic_t.

    This results in the following bug appearing

    FS-Cache: Assertion failed
    3 == 5 is false
    ------------[ cut here ]------------
    kernel BUG at fs/fscache/operation.c:421!

    or

    FS-Cache: Assertion failed
    3 == 5 is false
    ------------[ cut here ]------------
    kernel BUG at fs/fscache/operation.c:414!

    With a backtrace like the following:

    RIP: 0010:[] fscache_put_operation+0x1ad/0x240 [fscache]
    Call Trace:
    [] fscache_retrieval_work+0x55/0x270 [fscache]
    [] ? fscache_retrieval_work+0x0/0x270 [fscache]
    [] worker_thread+0x170/0x2a0
    [] ? autoremove_wake_function+0x0/0x40
    [] ? worker_thread+0x0/0x2a0
    [] kthread+0x96/0xa0
    [] child_rip+0xa/0x20
    [] ? kthread+0x0/0xa0
    [] ? child_rip+0x0/0x20

    Signed-off-by: David Howells
    Reviewed-and-tested-By: Milosz Tanski
    Acked-by: Jeff Layton

    David Howells
     
  • Simplify the way fscache cache objects retain their cookie. The way I
    implemented the cookie storage handling made synchronisation a pain (ie. the
    object state machine can't rely on the cookie actually still being there).

    Instead of the the object being detached from the cookie and the cookie being
    freed in __fscache_relinquish_cookie(), we defer both operations:

    (*) The detachment of the object from the list in the cookie now takes place
    in fscache_drop_object() and is thus governed by the object state machine
    (fscache_detach_from_cookie() has been removed).

    (*) The release of the cookie is now in fscache_object_destroy() - which is
    called by the cache backend just before it frees the object.

    This means that the fscache_cookie struct is now available to the cache all the
    way through from ->alloc_object() to ->drop_object() and ->put_object() -
    meaning that it's no longer necessary to take object->lock to guarantee access.

    However, __fscache_relinquish_cookie() doesn't wait for the object to go all
    the way through to destruction before letting the netfs proceed. That would
    massively slow down the netfs. Since __fscache_relinquish_cookie() leaves the
    cookie around, in must therefore break all attachments to the netfs - which
    includes ->def, ->netfs_data and any outstanding page read/writes.

    To handle this, struct fscache_cookie now has an n_active counter:

    (1) This starts off initialised to 1.

    (2) Any time the cache needs to get at the netfs data, it calls
    fscache_use_cookie() to increment it - if it is not zero. If it was zero,
    then access is not permitted.

    (3) When the cache has finished with the data, it calls fscache_unuse_cookie()
    to decrement it. This does a wake-up on it if it reaches 0.

    (4) __fscache_relinquish_cookie() decrements n_active and then waits for it to
    reach 0. The initialisation to 1 in step (1) ensures that we only get
    wake ups when we're trying to get rid of the cookie.

    This leaves __fscache_relinquish_cookie() a lot simpler.

    ***
    This fixes a problem in the current code whereby if fscache_invalidate() is
    followed sufficiently quickly by fscache_relinquish_cookie() then it is
    possible for __fscache_relinquish_cookie() to have detached the cookie from the
    object and cleared the pointer before a thread is dispatched to process the
    invalidation state in the object state machine.

    Since the pending write clearance was deferred to the invalidation state to
    make it asynchronous, we need to either wait in relinquishment for the stores
    tree to be cleared in the invalidation state or we need to handle the clearance
    in relinquishment.

    Further, if the relinquishment code does clear the tree, then the invalidation
    state need to make the clearance contingent on still having the cookie to hand
    (since that's where the tree is rooted) and we have to prevent the cookie from
    disappearing for the duration.

    This can lead to an oops like the following:

    BUG: unable to handle kernel NULL pointer dereference at 000000000000000c
    ...
    RIP: 0010:[] _spin_lock+0xe/0x30
    ...
    CR2: 000000000000000c ...
    ...
    Process kslowd002 (...)
    ....
    Call Trace:
    [] fscache_invalidate_writes+0x38/0xd0 [fscache]
    [] ? __switch_to+0xd0/0x320
    [] ? find_busiest_queue+0x69/0x150
    [] ? slow_work_enqueue+0x104/0x180
    [] fscache_object_slow_work_execute+0x5e3/0x9d0 [fscache]
    [] ? bit_waitqueue+0x17/0xd0
    [] slow_work_execute+0x233/0x310
    [] slow_work_thread+0x205/0x360
    [] ? autoremove_wake_function+0x0/0x40
    [] ? slow_work_thread+0x0/0x360
    [] kthread+0x96/0xa0
    [] child_rip+0xa/0x20
    [] ? kthread+0x0/0xa0
    [] ? child_rip+0x0/0x20

    The parameter to fscache_invalidate_writes() was object->cookie which is NULL.

    Signed-off-by: David Howells
    Tested-By: Milosz Tanski
    Acked-by: Jeff Layton

    David Howells
     
  • Fix object state machine to have separate work and wait states as that makes
    it easier to envision.

    There are now three kinds of state:

    (1) Work state. This is an execution state. No event processing is performed
    by a work state. The function attached to a work state returns a pointer
    indicating the next state to which the OSM should transition. Returning
    NO_TRANSIT repeats the current state, but goes back to the scheduler
    first.

    (2) Wait state. This is an event processing state. No execution is
    performed by a wait state. Wait states are just tables of "if event X
    occurs, clear it and transition to state Y". The dispatcher returns to
    the scheduler if none of the events in which the wait state has an
    interest are currently pending.

    (3) Out-of-band state. This is a special work state. Transitions to normal
    states can be overridden when an unexpected event occurs (eg. I/O error).
    Instead the dispatcher disables and clears the OOB event and transits to
    the specified work state. This then acts as an ordinary work state,
    though object->state points to the overridden destination. Returning
    NO_TRANSIT resumes the overridden transition.

    In addition, the states have names in their definitions, so there's no need for
    tables of state names. Further, the EV_REQUEUE event is no longer necessary as
    that is automatic for work states.

    Since the states are now separate structs rather than values in an enum, it's
    not possible to use comparisons other than (non-)equality between them, so use
    some object->flags to indicate what phase an object is in.

    The EV_RELEASE, EV_RETIRE and EV_WITHDRAW events have been squished into one
    (EV_KILL). An object flag now carries the information about retirement.

    Similarly, the RELEASING, RECYCLING and WITHDRAWING states have been merged
    into an KILL_OBJECT state and additional states have been added for handling
    waiting dependent objects (JUMPSTART_DEPS and KILL_DEPENDENTS).

    A state has also been added for synchronising with parent object initialisation
    (WAIT_FOR_PARENT) and another for initiating look up (PARENT_READY).

    Signed-off-by: David Howells
    Tested-By: Milosz Tanski
    Acked-by: Jeff Layton

    David Howells
     
  • Wrap checks on object state (mostly outside of fs/fscache/object.c) with
    inline functions so that the mechanism can be replaced.

    Some of the state checks within object.c are left as-is as they will be
    replaced.

    Signed-off-by: David Howells
    Tested-By: Milosz Tanski
    Acked-by: Jeff Layton

    David Howells
     
  • Uninline fscache_object_init() so as not to expose some of the FS-Cache
    internals to the cache backend.

    Signed-off-by: David Howells
    Tested-By: Milosz Tanski
    Acked-by: Jeff Layton

    David Howells
     
  • Don't sleep in __fscache_maybe_release_page() if __GFP_FS is not set. This
    goes some way towards mitigating fscache deadlocking against ext4 by way of
    the allocator, eg:

    INFO: task flush-8:0:24427 blocked for more than 120 seconds.
    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    flush-8:0 D ffff88003e2b9fd8 0 24427 2 0x00000000
    ffff88003e2b9138 0000000000000046 ffff880012e3a040 ffff88003e2b9fd8
    0000000000011c80 ffff88003e2b9fd8 ffffffff81a10400 ffff880012e3a040
    0000000000000002 ffff880012e3a040 ffff88003e2b9098 ffffffff8106dcf5
    Call Trace:
    [] ? __lock_is_held+0x31/0x53
    [] ? radix_tree_lookup_element+0xf4/0x12a
    [] schedule+0x60/0x62
    [] __fscache_wait_on_page_write+0x8b/0xa5 [fscache]
    [] ? __init_waitqueue_head+0x4d/0x4d
    [] __fscache_maybe_release_page+0x30c/0x324 [fscache]
    [] ? __fscache_maybe_release_page+0x6c/0x324 [fscache]
    [] ? trace_hardirqs_on_caller+0x114/0x170
    [] nfs_fscache_release_page+0x68/0x94 [nfs]
    [] nfs_release_page+0x7e/0x86 [nfs]
    [] try_to_release_page+0x32/0x3b
    [] shrink_page_list+0x535/0x71a
    [] ? trace_hardirqs_on_caller+0x114/0x170
    [] shrink_inactive_list+0x20a/0x2dd
    [] ? mark_held_locks+0xbe/0xea
    [] shrink_lruvec+0x34c/0x3eb
    [] do_try_to_free_pages+0xcf/0x355
    [] try_to_free_pages+0x9a/0xa1
    [] __alloc_pages_nodemask+0x494/0x6f7
    [] kmem_getpages+0x58/0x155
    [] fallback_alloc+0x120/0x1f3
    [] ? trace_hardirqs_off+0xd/0xf
    [] ____cache_alloc_node+0x177/0x186
    [] ? ext4_init_io_end+0x1c/0x37
    [] kmem_cache_alloc+0xf1/0x176
    [] ? test_set_page_writeback+0x101/0x113
    [] ext4_init_io_end+0x1c/0x37
    [] ext4_bio_write_page+0x20f/0x3af
    [] mpage_da_submit_io+0x26e/0x2f6
    [] ? __find_get_block_slow+0x38/0x133
    [] mpage_da_map_and_submit+0x3a7/0x3bd
    [] ext4_da_writepages+0x30d/0x426
    [] do_writepages+0x1c/0x2a
    [] __writeback_single_inode+0x3e/0xe5
    [] writeback_sb_inodes+0x1bd/0x2f4
    [] __writeback_inodes_wb+0x6f/0xb4
    [] wb_writeback+0x101/0x195
    [] ? trace_hardirqs_on_caller+0x114/0x170
    [] ? wb_do_writeback+0xaa/0x173
    [] wb_do_writeback+0x4a/0x173
    [] ? trace_hardirqs_on+0xd/0xf
    [] ? del_timer+0x4b/0x5b
    [] bdi_writeback_thread+0x6d/0x147
    [] ? wb_do_writeback+0x173/0x173
    [] kthread+0xd0/0xd8
    [] ? _raw_spin_unlock_irq+0x29/0x3e
    [] ? __init_kthread_worker+0x55/0x55
    [] ret_from_fork+0x7c/0xb0
    [] ? __init_kthread_worker+0x55/0x55
    2 locks held by flush-8:0/24427:
    #0: (&type->s_umount_key#41){.+.+..}, at: [] grab_super_passive+0x4c/0x76
    #1: (jbd2_handle){+.+...}, at: [] start_this_handle+0x475/0x4ea

    The problem here is that another thread, which is attempting to write the
    to-be-stored NFS page to the on-ext4 cache file is waiting for the journal
    lock, eg:

    INFO: task kworker/u:2:24437 blocked for more than 120 seconds.
    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    kworker/u:2 D ffff880039589768 0 24437 2 0x00000000
    ffff8800395896d8 0000000000000046 ffff8800283bf040 ffff880039589fd8
    0000000000011c80 ffff880039589fd8 ffff880039f0b040 ffff8800283bf040
    0000000000000006 ffff8800283bf6b8 ffff880039589658 ffffffff81071a13
    Call Trace:
    [] ? mark_held_locks+0xbe/0xea
    [] ? _raw_spin_unlock_irqrestore+0x3a/0x50
    [] ? trace_hardirqs_on_caller+0x114/0x170
    [] ? trace_hardirqs_on+0xd/0xf
    [] schedule+0x60/0x62
    [] start_this_handle+0x317/0x4ea
    [] ? __init_waitqueue_head+0x4d/0x4d
    [] jbd2__journal_start+0xb3/0x12e
    [] __ext4_journal_start_sb+0xb2/0xc6
    [] ext4_da_write_begin+0x109/0x233
    [] generic_file_buffered_write+0x11a/0x264
    [] ? __mark_inode_dirty+0x2d/0x1ee
    [] __generic_file_aio_write+0x2a5/0x2d5
    [] generic_file_aio_write+0x6f/0xd0
    [] ext4_file_write+0x38c/0x3c4
    [] do_sync_write+0x91/0xd1
    [] cachefiles_write_page+0x26f/0x310 [cachefiles]
    [] fscache_write_op+0x21e/0x37a [fscache]
    [] ? _raw_spin_unlock_irq+0x29/0x3e
    [] fscache_op_work_func+0x78/0xd7 [fscache]
    [] process_one_work+0x232/0x3a8
    [] ? process_one_work+0x1d7/0x3a8
    [] worker_thread+0x214/0x303
    [] ? manage_workers+0x245/0x245
    [] kthread+0xd0/0xd8
    [] ? _raw_spin_unlock_irq+0x29/0x3e
    [] ? __init_kthread_worker+0x55/0x55
    [] ret_from_fork+0x7c/0xb0
    [] ? __init_kthread_worker+0x55/0x55
    4 locks held by kworker/u:2/24437:
    #0: (fscache_operation){.+.+.+}, at: [] process_one_work+0x1d7/0x3a8
    #1: ((&op->work)){+.+.+.}, at: [] process_one_work+0x1d7/0x3a8
    #2: (sb_writers#14){.+.+.+}, at: [] generic_file_aio_write+0x51/0xd0
    #3: (&sb->s_type->i_mutex_key#19){+.+.+.}, at: [] generic_file_aio_write+0x5b/0x

    fscache already tries to cancel pending stores, but it can't cancel a write
    for which I/O is already in progress.

    An alternative would be to accept writing garbage to the cache under extreme
    circumstances and to kill the afflicted cache object if we have to do this.
    However, we really need to know how strapped the allocator is before deciding
    to do that.

    Signed-off-by: David Howells
    Tested-By: Milosz Tanski
    Acked-by: Jeff Layton

    David Howells
     
  • The spinlock() within the condition in while() will cause a compile error
    if it is not a function. This is not a problem on mainline but it does not
    look pretty and there is no reason to do it that way.
    That patch writes it a little differently and avoids the double condition.

    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: David Howells
    Tested-By: Milosz Tanski
    Acked-by: Jeff Layton

    Sebastian Andrzej Siewior
     

30 Apr, 2013

1 commit

  • There is a kernel memory leak observed when the proc file
    /proc/fs/fscache/stats is read.

    The reason is that in fscache_stats_open, single_open is called and the
    respective release function is not called during release. Hence fix
    with correct release function - single_release().

    Addresses https://bugzilla.kernel.org/show_bug.cgi?id=57101

    Signed-off-by: Anurup m
    Cc: shyju pv
    Cc: Sanil kumar
    Cc: Nataraj m
    Cc: Li Zefan
    Cc: David Howells
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anurup m
     

28 Feb, 2013

1 commit

  • I'm not sure why, but the hlist for each entry iterators were conceived

    list_for_each_entry(pos, head, member)

    The hlist ones were greedy and wanted an extra parameter:

    hlist_for_each_entry(tpos, pos, head, member)

    Why did they need an extra pos parameter? I'm not quite sure. Not only
    they don't really need it, it also prevents the iterator from looking
    exactly like the list iterator, which is unfortunate.

    Besides the semantic patch, there was some manual work required:

    - Fix up the actual hlist iterators in linux/list.h
    - Fix up the declaration of other iterators based on the hlist ones.
    - A very small amount of places were using the 'node' parameter, this
    was modified to use 'obj->member' instead.
    - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
    properly, so those had to be fixed up manually.

    The semantic patch which is mostly the work of Peter Senna Tschudin is here:

    @@
    iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;

    type T;
    expression a,c,d,e;
    identifier b;
    statement S;
    @@

    -T b;

    [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
    [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
    [akpm@linux-foundation.org: checkpatch fixes]
    [akpm@linux-foundation.org: fix warnings]
    [akpm@linux-foudnation.org: redo intrusive kvm changes]
    Tested-by: Peter Senna Tschudin
    Acked-by: Paul E. McKenney
    Signed-off-by: Sasha Levin
    Cc: Wu Fengguang
    Cc: Marcelo Tosatti
    Cc: Gleb Natapov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     

21 Dec, 2012

17 commits

  • Provide fscache_cancel_op() with a pointer to a function it should invoke under
    lock if it cancels an operation.

    Use this to clear the remaining page count upon cancellation of a pending
    retrieval operation so that fscache_release_retrieval_op() doesn't get an
    assertion failure (see below). This can happen when a signal occurs, say from
    CTRL-C being pressed during data retrieval.

    FS-Cache: Assertion failed
    3 == 0 is false
    ------------[ cut here ]------------
    kernel BUG at fs/fscache/page.c:237!
    invalid opcode: 0000 [#641] SMP
    Modules linked in: cachefiles(F) nfsv4(F) nfsv3(F) nfsv2(F) nfs(F) fscache(F) auth_rpcgss(F) nfs_acl(F) lockd(F) sunrpc(F)
    CPU 0
    Pid: 6075, comm: slurp-q Tainted: GF D 3.7.0-rc8-fsdevel+ #411 /DG965RY
    RIP: 0010:[] [] fscache_release_retrieval_op+0x75/0xff [fscache]
    RSP: 0000:ffff88001c6d7988 EFLAGS: 00010296
    RAX: 000000000000000f RBX: ffff880014cdfe00 RCX: ffffffff6c102000
    RDX: ffffffff8102d1ad RSI: ffffffff6c102000 RDI: ffffffff8102d1d6
    RBP: ffff88001c6d7998 R08: 0000000000000002 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000000 R12: 00000000fffffe00
    R13: ffff88001c6d7ab4 R14: ffff88001a8638a0 R15: ffff88001552b190
    FS: 00007f877aaf0700(0000) GS:ffff88003bc00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: 00007fff11378fd2 CR3: 000000001c6c6000 CR4: 00000000000007f0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Process slurp-q (pid: 6075, threadinfo ffff88001c6d6000, task ffff88001c6c4080)
    Stack:
    ffffffffa007ec07 ffff880014cdfe00 ffff88001c6d79c8 ffffffffa007db4d
    ffffffffa007ec07 ffff880014cdfe00 00000000fffffe00 ffff88001c6d7ab4
    ffff88001c6d7a38 ffffffffa008116d 0000000000000000 ffff88001c6c4080
    Call Trace:
    [] ? fscache_cancel_op+0x194/0x1cf [fscache]
    [] fscache_put_operation+0x135/0x2ed [fscache]
    [] ? fscache_cancel_op+0x194/0x1cf [fscache]
    [] __fscache_read_or_alloc_pages+0x413/0x4bc [fscache]
    [] ? __alloc_pages_nodemask+0x195/0x75c
    [] __nfs_readpages_from_fscache+0x86/0x13d [nfs]
    [] nfs_readpages+0x186/0x1bd [nfs]
    [] ? alloc_pages_current+0xc7/0xe4
    [] ? __page_cache_alloc+0x84/0x91
    [] ? __do_page_cache_readahead+0xa6/0x2e0
    [] __do_page_cache_readahead+0x237/0x2e0
    [] ? __do_page_cache_readahead+0xa6/0x2e0
    [] ra_submit+0x1c/0x20
    [] ondemand_readahead+0x359/0x382
    [] page_cache_sync_readahead+0x38/0x3a
    [] generic_file_aio_read+0x26b/0x637
    [] ? nfs_mark_delegation_referenced+0xb/0xb [nfsv4]
    [] nfs_file_read+0xaa/0xcf [nfs]
    [] do_sync_read+0x91/0xd1
    [] vfs_read+0x9b/0x144
    [] sys_read+0x44/0x75
    [] system_call_fastpath+0x16/0x1b

    Signed-off-by: David Howells

    David Howells
     
  • Mark as cancelled an operation that is in progress rather than pending at the
    time it is cancelled, and call fscache_complete_op() to cancel an operation so
    that blocked ops can be started.

    Signed-off-by: David Howells

    David Howells
     
  • In fscache_write_op(), if the object is determined to have become inactive or
    to have lost its cookie, we don't move the operation state from in-progress,
    and so an assertion in fscache_put_operation() fails with an assertion (see
    below).

    Instrumenting fscache_op_work_func() indicates that it called
    fscache_write_op() before calling fscache_put_operation() - where the assertion
    failed. The assertion at line 433 indicates that the operation state is
    IN_PROGRESS rather than being COMPLETE or CANCELLED.

    Instrumenting fscache_write_op() showed that it was being called on an object
    that had had its cookie removed and that this was due to relinquishment of the
    cookie by the netfs. At this point fscache no longer has access to the pages
    of netfs data that were requested to be written, and so simply cancelling the
    operation is the thing to do.

    FS-Cache: Assertion failed
    3 == 5 is false
    ------------[ cut here ]------------
    kernel BUG at fs/fscache/operation.c:433!
    invalid opcode: 0000 [#1] SMP
    Modules linked in: cachefiles(F) nfsv4(F) nfsv3(F) nfsv2(F) nfs(F) fscache(F) auth_rpcgss(F) nfs_acl(F) lockd(F) sunrpc(F)
    CPU 0
    Pid: 1035, comm: kworker/u:3 Tainted: GF 3.7.0-rc8-fsdevel+ #411 /DG965RY
    RIP: 0010:[] [] fscache_put_operation+0x11a/0x2ed [fscache]
    RSP: 0018:ffff88003e32bcf8 EFLAGS: 00010296
    RAX: 000000000000000f RBX: ffff88001818eb78 RCX: ffffffff6c102000
    RDX: ffffffff8102d1ad RSI: ffffffff6c102000 RDI: ffffffff8102d1d6
    RBP: ffff88003e32bd18 R08: 0000000000000002 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000000 R12: ffffffffa00811da
    R13: 0000000000000001 R14: 0000000100625d26 R15: 0000000000000000
    FS: 0000000000000000(0000) GS:ffff88003bc00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: 00007fff7dd31c68 CR3: 000000003d730000 CR4: 00000000000007f0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Process kworker/u:3 (pid: 1035, threadinfo ffff88003e32a000, task ffff88003bb38080)
    Stack:
    ffffffff8102d1ad ffff88001818eb78 ffffffffa00811da 0000000000000001
    ffff88003e32bd48 ffffffffa007f0ad ffff88001818eb78 ffffffff819583c0
    ffff88003df24e00 ffff88003882c3e0 ffff88003e32bde8 ffffffff81042de0
    Call Trace:
    [] ? vprintk_emit+0x3c6/0x41a
    [] ? __fscache_read_or_alloc_pages+0x4bc/0x4bc [fscache]
    [] fscache_op_work_func+0xec/0x123 [fscache]
    [] process_one_work+0x21c/0x3b0
    [] ? process_one_work+0x1be/0x3b0
    [] ? fscache_operation_gc+0x23e/0x23e [fscache]
    [] worker_thread+0x202/0x2df
    [] ? rescuer_thread+0x18e/0x18e
    [] kthread+0xd0/0xd8
    [] ? _raw_spin_unlock_irq+0x29/0x3e
    [] ? __init_kthread_worker+0x55/0x55
    [] ret_from_fork+0x7c/0xb0
    [] ? __init_kthread_worker+0x55/0x55

    Signed-off-by: David Howells

    David Howells
     
  • wait_on_bit() with TASK_INTERRUPTIBLE returns 1 rather than a negative error
    code, so change what we check for. This means that the signal handling in
    fscache_wait_for_retrieval_activation() should now work properly.

    Without this, the following bug can be seen if CTRL-C is pressed during
    fscache read operation:

    FS-Cache: Assertion failed
    2 == 3 is false
    ------------[ cut here ]------------
    kernel BUG at fs/fscache/page.c:347!
    invalid opcode: 0000 [#1] SMP
    Modules linked in: cachefiles(F) nfsv4(F) nfsv3(F) nfsv2(F) nfs(F) fscache(F) auth_rpcgss(F) nfs_acl(F) lockd(F) sunrpc(F)
    CPU 1
    Pid: 15006, comm: slurp-q Tainted: GF 3.7.0-rc8-fsdevel+ #411 /DG965RY
    RIP: 0010:[] [] fscache_wait_for_retrieval_activation+0x167/0x177 [fscache]
    RSP: 0018:ffff88002a4c39a8 EFLAGS: 00010292
    RAX: 000000000000001a RBX: ffff88002d3dc158 RCX: 0000000000008685
    RDX: ffffffff8102ccd6 RSI: 0000000000000001 RDI: ffffffff8102d1d6
    RBP: ffff88002a4c39c8 R08: 0000000000000002 R09: 0000000000000000
    R10: ffffffff8163afa0 R11: ffff88003bd11900 R12: ffffffffa00868c8
    R13: ffff880028306458 R14: ffff88002d3dc1b0 R15: ffff88001372e538
    FS: 00007f17426a0700(0000) GS:ffff88003bd00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: 00007f1742494a44 CR3: 0000000031bd7000 CR4: 00000000000007e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Process slurp-q (pid: 15006, threadinfo ffff88002a4c2000, task ffff880023de3040)
    Stack:
    ffff88002d3dc158 ffff88001372e538 ffff88002a4c3ab4 ffff8800283064e0
    ffff88002a4c3a38 ffffffffa0080f6d 0000000000000000 ffff880023de3040
    ffff88002a4c3ac8 ffffffff810ac8ae ffff880028306458 ffff88002a4c3bc8
    Call Trace:
    [] __fscache_read_or_alloc_pages+0x24f/0x4bc [fscache]
    [] ? __alloc_pages_nodemask+0x195/0x75c
    [] __nfs_readpages_from_fscache+0x86/0x13d [nfs]
    [] nfs_readpages+0x186/0x1bd [nfs]
    [] ? alloc_pages_current+0xc7/0xe4
    [] ? __page_cache_alloc+0x84/0x91
    [] ? __do_page_cache_readahead+0xa6/0x2e0
    [] __do_page_cache_readahead+0x237/0x2e0
    [] ? __do_page_cache_readahead+0xa6/0x2e0
    [] ra_submit+0x1c/0x20
    [] ondemand_readahead+0x359/0x382
    [] page_cache_sync_readahead+0x38/0x3a
    [] generic_file_aio_read+0x26b/0x637
    [] ? nfs_mark_delegation_referenced+0xb/0xb [nfsv4]
    [] nfs_file_read+0xaa/0xcf [nfs]
    [] do_sync_read+0x91/0xd1
    [] vfs_read+0x9b/0x144
    [] sys_read+0x44/0x75
    [] system_call_fastpath+0x16/0x1b

    Signed-off-by: David Howells

    David Howells
     
  • Add a missing transition to the FS-Cache object state machine to handle an
    invalidation event occuring between the back end completing the object lookup
    by calling fscache_obtained_object() (which moves to state OBJECT_AVAILABLE)
    and the backend returning to fscache_lookup_object() and thence to
    fscache_object_state_machine() which then does a goto lookup_transit to handle
    the transition - but lookup_transit doesn't handle EV_INVALIDATE.

    Without this, the following BUG can be logged:

    FS-Cache: Unsupported event 2 [5/f7] in state OBJECT_AVAILABLE
    ------------[ cut here ]------------
    kernel BUG at fs/fscache/object.c:357!

    Where event 2 is EV_INVALIDATE.

    Signed-off-by: David Howells

    David Howells
     
  • nfs_migrate_page() does not wait for FS-Cache to finish with a page, probably
    leading to the following bad-page-state:

    BUG: Bad page state in process python-bin pfn:17d39b
    page:ffffea00053649e8 flags:004000000000100c count:0 mapcount:0 mapping:(null)
    index:38686 (Tainted: G B ---------------- )
    Pid: 31053, comm: python-bin Tainted: G B ----------------
    2.6.32-71.24.1.el6.x86_64 #1
    Call Trace:
    [] bad_page+0x107/0x160
    [] free_hot_cold_page+0x1c9/0x220
    [] __pagevec_free+0x59/0xb0
    [] ? flush_tlb_others_ipi+0x128/0x130
    [] release_pages+0x21c/0x250
    [] ? remove_migration_pte+0x28a/0x2b0
    [] ? mem_cgroup_get_reclaim_stat_from_page+0x18/0x70
    [] ____pagevec_lru_add+0x167/0x180
    [] __lru_cache_add+0x58/0x70
    [] lru_cache_add_lru+0x21/0x40
    [] putback_lru_page+0x69/0x100
    [] migrate_pages+0x13d/0x5d0
    [] ? ____pagevec_lru_add+0x167/0x180
    [] ? compaction_alloc+0x0/0x370
    [] compact_zone+0x4cc/0x600
    [] ? get_page_from_freelist+0x15c/0x820
    [] ? check_preempt_wakeup+0x1c4/0x3c0
    [] compact_zone_order+0x7e/0xb0
    [] try_to_compact_pages+0x109/0x170
    [] __alloc_pages_nodemask+0x5ed/0x850
    [] ? thread_return+0x4e/0x778
    [] alloc_pages_vma+0x93/0x150
    [] do_huge_pmd_anonymous_page+0x135/0x340
    [] ? rwsem_down_read_failed+0x26/0x30
    [] handle_mm_fault+0x245/0x2b0
    [] do_page_fault+0x123/0x3a0
    [] page_fault+0x25/0x30

    nfs_migrate_page() calls nfs_fscache_release_page() which doesn't actually wait
    - even if __GFP_WAIT is set. The reason that doesn't wait is that
    fscache_maybe_release_page() might deadlock the allocator as the work threads
    writing to the cache may all end up sleeping on memory allocation.

    However, I wonder if that is actually a problem. There are a number of things
    I can do to deal with this:

    (1) Make nfs_migrate_page() wait.

    (2) Make fscache_maybe_release_page() honour the __GFP_WAIT flag.

    (3) Set a timeout around the wait.

    (4) Make nfs_migrate_page() return an error if the page is still busy.

    For the moment, I'll select (2) and (4).

    Signed-off-by: David Howells
    Acked-by: Jeff Layton

    David Howells
     
  • The function to submit an exclusive op (fscache_submit_exclusive_op()) can BUG
    if there's been an I/O error because it may see the parent cache object in an
    unexpected state. It should only BUG if there hasn't been an I/O error.

    In this case the problem was produced by remounting the cache partition to be
    R/O. The EROFS state was detected and the cache was aborted, but not
    everything handled the aborting correctly.

    SysRq : Emergency Remount R/O
    EXT4-fs (sda6): re-mounted. Opts: (null)
    Emergency Remount complete
    CacheFiles: I/O Error: Failed to update xattr with error -30
    FS-Cache: Cache cachefiles stopped due to I/O error
    ------------[ cut here ]------------
    kernel BUG at fs/fscache/operation.c:128!
    invalid opcode: 0000 [#1] SMP
    CPU 0
    Modules linked in: cachefiles nfs fscache auth_rpcgss nfs_acl lockd sunrpc

    Pid: 6612, comm: kworker/u:2 Not tainted 3.1.0-rc8-fsdevel+ #1093 /DG965RY
    RIP: 0010:[] [] fscache_submit_exclusive_op+0x2ad/0x2c2 [fscache]
    RSP: 0018:ffff880000853d40 EFLAGS: 00010206
    RAX: ffff880038ac72a8 RBX: ffff8800181f2260 RCX: ffffffff81f2b2b0
    RDX: 0000000000000001 RSI: ffffffff8179a478 RDI: ffff8800181f2280
    RBP: ffff880000853d60 R08: 0000000000000002 R09: 0000000000000000
    R10: 0000000000000001 R11: 0000000000000001 R12: ffff880038ac7268
    R13: ffff8800181f2280 R14: ffff88003a359190 R15: 000000010122b162
    FS: 0000000000000000(0000) GS:ffff88003bc00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: 00000034cc4a77f0 CR3: 0000000010e96000 CR4: 00000000000006f0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Process kworker/u:2 (pid: 6612, threadinfo ffff880000852000, task ffff880014c3c040)
    Stack:
    ffff8800181f2260 ffff8800181f2310 ffff880038ac7268 ffff8800181f2260
    ffff880000853dc0 ffffffffa0072375 ffff880037ecfe00 ffff88003a359198
    ffff880000853dc0 0000000000000246 0000000000000000 ffff88000a91d308
    Call Trace:
    [] fscache_object_work_func+0x792/0xe65 [fscache]
    [] process_one_work+0x1eb/0x37f
    [] ? process_one_work+0x18d/0x37f
    [] ? fscache_enqueue_dependents+0xd8/0xd8 [fscache]
    [] worker_thread+0x15a/0x21a
    [] ? rescuer_thread+0x188/0x188
    [] kthread+0x7f/0x87
    [] kernel_thread_helper+0x4/0x10
    [] ? finish_task_switch+0x45/0xc0
    [] ? retint_restore_args+0xe/0xe
    [] ? __init_kthread_worker+0x53/0x53
    [] ? gs_change+0xb/0xb

    Signed-off-by: David Howells

    David Howells
     
  • Limit the number of I/O error reports for a cache to 1 to prevent massive
    amounts of noise. After the first I/O error the cache is taken off line
    automatically, so must be restarted to resume caching.

    Signed-off-by: David Howells

    David Howells
     
  • Don't mask off the object event mask when printing it. That way it can be seen
    if threre are bits set that shouldn't be.

    Signed-off-by: David Howells

    David Howells
     
  • Initialise the object event mask with the calculated mask rather than unmasking
    undefined events also.

    Signed-off-by: David Howells

    David Howells
     
  • CacheFiles is missing some calls to fscache_retrieval_complete() in the error
    handling/collision paths of its reader functions.

    This can be seen by the following assertion tripping in fscache_put_operation()
    whereby the operation being destroyed is still in the in-progress state and has
    not been cancelled or completed:

    FS-Cache: Assertion failed
    3 == 5 is false
    ------------[ cut here ]------------
    kernel BUG at fs/fscache/operation.c:408!
    invalid opcode: 0000 [#1] SMP
    CPU 2
    Modules linked in: xfs ioatdma dca loop joydev evdev
    psmouse dcdbas pcspkr serio_raw i5000_edac edac_core i5k_amb shpchp
    pci_hotplug sg sr_mod]

    Pid: 8062, comm: httpd Not tainted 3.1.0-rc8 #1 Dell Inc. PowerEdge 1950/0DT097
    RIP: 0010:[] [] fscache_put_operation+0x304/0x330
    RSP: 0018:ffff880062f739d8 EFLAGS: 00010296
    RAX: 0000000000000025 RBX: ffff8800c5122e84 RCX: ffffffff81ddf040
    RDX: 00000000ffffffff RSI: 0000000000000082 RDI: ffffffff81ddef30
    RBP: ffff880062f739f8 R08: 0000000000000005 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000003 R12: ffff8800c5122e40
    R13: ffff880037a2cd20 R14: ffff880087c7a058 R15: ffff880087c7a000
    FS: 00007f63dcf636e0(0000) GS:ffff88022fc80000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007f0c0a91f000 CR3: 0000000062ec2000 CR4: 00000000000006e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Process httpd (pid: 8062, threadinfo ffff880062f72000, task ffff880087e58000)
    Stack:
    ffff880062f73bf8 0000000000000000 ffff880062f73bf8 ffff880037a2cd20
    ffff880062f73a68 ffffffff8119aa7e ffff88006540e000 ffff880062f73ad4
    ffff88008e9a4308 ffff880037a2cd20 ffff880062f73a48 ffff8800c5122e40
    Call Trace:
    [] __fscache_read_or_alloc_pages+0x1fe/0x530
    [] __nfs_readpages_from_fscache+0x70/0x1c0
    [] nfs_readpages+0xca/0x1e0
    [] ? rpc_do_put_task+0x36/0x50
    [] ? alloc_nfs_open_context+0x4b/0x110
    [] ? rpc_call_sync+0x5a/0x70
    [] __do_page_cache_readahead+0x1ca/0x270
    [] ra_submit+0x21/0x30
    [] ondemand_readahead+0x11d/0x250
    [] page_cache_sync_readahead+0x36/0x60
    [] generic_file_aio_read+0x454/0x770
    [] nfs_file_read+0xe1/0x130
    [] do_sync_read+0xd9/0x120
    [] ? mntput+0x1f/0x40
    [] ? fput+0x1cb/0x260
    [] vfs_read+0xc8/0x180
    [] sys_read+0x55/0x90

    Reported-by: Mark Moseley
    Signed-off-by: David Howells

    David Howells
     
  • Provide a proper invalidation method rather than relying on the netfs retiring
    the cookie it has and getting a new one. The problem with this is that isn't
    easy for the netfs to make sure that it has completed/cancelled all its
    outstanding storage and retrieval operations on the cookie it is retiring.

    Instead, have the cache provide an invalidation method that will cancel or wait
    for all currently outstanding operations before invalidating the cache, and
    will cause new operations to queue up behind that. Whilst invalidation is in
    progress, some requests will be rejected until the cache can stack a barrier on
    the operation queue to cause new operations to be deferred behind it.

    Signed-off-by: David Howells

    David Howells
     
  • Fix the state management of internal fscache operations and the accounting of
    what operations are in what states.

    This is done by:

    (1) Give struct fscache_operation a enum variable that directly represents the
    state it's currently in, rather than spreading this knowledge over a bunch
    of flags, who's processing the operation at the moment and whether it is
    queued or not.

    This makes it easier to write assertions to check the state at various
    points and to prevent invalid state transitions.

    (2) Add an 'operation complete' state and supply a function to indicate the
    completion of an operation (fscache_op_complete()) and make things call
    it. The final call to fscache_put_operation() can then check that an op
    in the appropriate state (complete or cancelled).

    (3) Adjust the use of object->n_ops, ->n_in_progress, ->n_exclusive to better
    govern the state of an object:

    (a) The ->n_ops is now the number of extant operations on the object
    and is now decremented by fscache_put_operation() only.

    (b) The ->n_in_progress is simply the number of objects that have been
    taken off of the object's pending queue for the purposes of being
    run. This is decremented by fscache_op_complete() only.

    (c) The ->n_exclusive is the number of exclusive ops that have been
    submitted and queued or are in progress. It is decremented by
    fscache_op_complete() and by fscache_cancel_op().

    fscache_put_operation() and fscache_operation_gc() now no longer try to
    clean up ->n_exclusive and ->n_in_progress. That was leading to double
    decrements against fscache_cancel_op().

    fscache_cancel_op() now no longer decrements ->n_ops. That was leading to
    double decrements against fscache_put_operation().

    fscache_submit_exclusive_op() now decides whether it has to queue an op
    based on ->n_in_progress being > 0 rather than ->n_ops > 0 as the latter
    will persist in being true even after all preceding operations have been
    cancelled or completed. Furthermore, if an object is active and there are
    runnable ops against it, there must be at least one op running.

    (4) Add a remaining-pages counter (n_pages) to struct fscache_retrieval and
    provide a function to record completion of the pages as they complete.

    When n_pages reaches 0, the operation is deemed to be complete and
    fscache_op_complete() is called.

    Add calls to fscache_retrieval_complete() anywhere we've finished with a
    page we've been given to read or allocate for. This includes places where
    we just return pages to the netfs for reading from the server and where
    accessing the cache fails and we discard the proposed netfs page.

    The bugs in the unfixed state management manifest themselves as oopses like the
    following where the operation completion gets out of sync with return of the
    cookie by the netfs. This is possible because the cache unlocks and returns
    all the netfs pages before recording its completion - which means that there's
    nothing to stop the netfs discarding them and returning the cookie.

    FS-Cache: Cookie 'NFS.fh' still has outstanding reads
    ------------[ cut here ]------------
    kernel BUG at fs/fscache/cookie.c:519!
    invalid opcode: 0000 [#1] SMP
    CPU 1
    Modules linked in: cachefiles nfs fscache auth_rpcgss nfs_acl lockd sunrpc

    Pid: 400, comm: kswapd0 Not tainted 3.1.0-rc7-fsdevel+ #1090 /DG965RY
    RIP: 0010:[] [] __fscache_relinquish_cookie+0x170/0x343 [fscache]
    RSP: 0018:ffff8800368cfb00 EFLAGS: 00010282
    RAX: 000000000000003c RBX: ffff880023cc8790 RCX: 0000000000000000
    RDX: 0000000000002f2e RSI: 0000000000000001 RDI: ffffffff813ab86c
    RBP: ffff8800368cfb50 R08: 0000000000000002 R09: 0000000000000000
    R10: ffff88003a1b7890 R11: ffff88001df6e488 R12: ffff880023d8ed98
    R13: ffff880023cc8798 R14: 0000000000000004 R15: ffff88003b8bf370
    FS: 0000000000000000(0000) GS:ffff88003bd00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: 00000000008ba008 CR3: 0000000023d93000 CR4: 00000000000006e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Process kswapd0 (pid: 400, threadinfo ffff8800368ce000, task ffff88003b8bf040)
    Stack:
    ffff88003b8bf040 ffff88001df6e528 ffff88001df6e528 ffffffffa00b46b0
    ffff88003b8bf040 ffff88001df6e488 ffff88001df6e620 ffffffffa00b46b0
    ffff88001ebd04c8 0000000000000004 ffff8800368cfb70 ffffffffa00b2c91
    Call Trace:
    [] nfs_fscache_release_inode_cookie+0x3b/0x47 [nfs]
    [] nfs_clear_inode+0x3c/0x41 [nfs]
    [] nfs4_evict_inode+0x2f/0x33 [nfs]
    [] evict+0xa1/0x15c
    [] dispose_list+0x2c/0x38
    [] prune_icache_sb+0x28c/0x29b
    [] prune_super+0xd5/0x140
    [] shrink_slab+0x102/0x1ab
    [] balance_pgdat+0x2f2/0x595
    [] ? process_timeout+0xb/0xb
    [] kswapd+0x270/0x289
    [] ? __init_waitqueue_head+0x46/0x46
    [] ? balance_pgdat+0x595/0x595
    [] kthread+0x7f/0x87
    [] kernel_thread_helper+0x4/0x10
    [] ? finish_task_switch+0x45/0xc0
    [] ? retint_restore_args+0xe/0xe
    [] ? __init_kthread_worker+0x53/0x53
    [] ? gs_change+0xb/0xb

    Signed-off-by: David Howells

    David Howells
     
  • Make fscache_relinquish_cookie() log a warning and wait if there are any
    outstanding reads left on the cookie it was given.

    Signed-off-by: David Howells

    David Howells
     
  • Check that the netfs isn't trying to relinquish a cookie that still has read
    operations in progress upon it. If there are, then give log a warning and BUG.

    Signed-off-by: David Howells

    David Howells
     
  • Downgrade the requirements passed to the allocator in the gfp flags parameter.
    FS-Cache/CacheFiles can handle OOM conditions simply by aborting the attempt to
    store an object or a page in the cache.

    Signed-off-by: David Howells

    David Howells
     
  • Under some circumstances CacheFiles defers the marking of pages with PG_fscache
    so that it can take advantage of pagevecs to reduce the number of calls to
    fscache_mark_pages_cached() and the netfs's hook to keep track of this.

    There are, however, two problems with this:

    (1) It can lead to the PG_fscache mark being applied _after_ the page is set
    PG_uptodate and unlocked (by the call to fscache_end_io()).

    (2) CacheFiles's ref on the page is dropped immediately following
    fscache_end_io() - and so may not still be held when the mark is applied.
    This can lead to the page being passed back to the allocator before the
    mark is applied.

    Fix this by, where appropriate, marking the page before calling
    fscache_end_io() and releasing the page. This means that we can't take
    advantage of pagevecs and have to make a separate call for each page to the
    marking routines.

    The symptoms of this are Bad Page state errors cropping up under memory
    pressure, for example:

    BUG: Bad page state in process tar pfn:002da
    page:ffffea0000009fb0 count:0 mapcount:0 mapping: (null) index:0x1447
    page flags: 0x1000(private_2)
    Pid: 4574, comm: tar Tainted: G W 3.1.0-rc4-fsdevel+ #1064
    Call Trace:
    [] ? dump_page+0xb9/0xbe
    [] bad_page+0xd5/0xea
    [] get_page_from_freelist+0x35b/0x46a
    [] __alloc_pages_nodemask+0x362/0x662
    [] __do_page_cache_readahead+0x13a/0x267
    [] ? __do_page_cache_readahead+0xa2/0x267
    [] ra_submit+0x1c/0x20
    [] ondemand_readahead+0x28b/0x29a
    [] ? ondemand_readahead+0x163/0x29a
    [] page_cache_sync_readahead+0x38/0x3a
    [] generic_file_aio_read+0x2ab/0x67e
    [] nfs_file_read+0xa4/0xc9 [nfs]
    [] do_sync_read+0xba/0xfa
    [] ? security_file_permission+0x7b/0x84
    [] ? rw_verify_area+0xab/0xc8
    [] vfs_read+0xaa/0x13a
    [] sys_read+0x45/0x6c
    [] system_call_fastpath+0x16/0x1b

    As can be seen, PG_private_2 (== PG_fscache) is set in the page flags.

    Instrumenting fscache_mark_pages_cached() to verify whether page->mapping was
    set appropriately showed that sometimes it wasn't. This led to the discovery
    that sometimes the page has apparently been reclaimed by the time the marker
    got to see it.

    Reported-by: M. Stevens
    Signed-off-by: David Howells
    Reviewed-by: Jeff Layton

    David Howells
     

22 Jul, 2011

1 commit


08 Jul, 2011

1 commit

  • Add an FS-Cache helper to bulk uncache pages on an inode. This will
    only work for the circumstance where the pages in the cache correspond
    1:1 with the pages attached to an inode's page cache.

    This is required for CIFS and NFS: When disabling inode cookie, we were
    returning the cookie and setting cifsi->fscache to NULL but failed to
    invalidate any previously mapped pages. This resulted in "Bad page
    state" errors and manifested in other kind of errors when running
    fsstress. Fix it by uncaching mapped pages when we disable the inode
    cookie.

    This patch should fix the following oops and "Bad page state" errors
    seen during fsstress testing.

    ------------[ cut here ]------------
    kernel BUG at fs/cachefiles/namei.c:201!
    invalid opcode: 0000 [#1] SMP
    Pid: 5, comm: kworker/u:0 Not tainted 2.6.38.7-30.fc15.x86_64 #1 Bochs Bochs
    RIP: 0010: cachefiles_walk_to_object+0x436/0x745 [cachefiles]
    RSP: 0018:ffff88002ce6dd00 EFLAGS: 00010282
    RAX: ffff88002ef165f0 RBX: ffff88001811f500 RCX: 0000000000000000
    RDX: 0000000000000000 RSI: 0000000000000100 RDI: 0000000000000282
    RBP: ffff88002ce6dda0 R08: 0000000000000100 R09: ffffffff81b3a300
    R10: 0000ffff00066c0a R11: 0000000000000003 R12: ffff88002ae54840
    R13: ffff88002ae54840 R14: ffff880029c29c00 R15: ffff88001811f4b0
    FS: 00007f394dd32720(0000) GS:ffff88002ef00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: 00007fffcb62ddf8 CR3: 000000001825f000 CR4: 00000000000006e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Process kworker/u:0 (pid: 5, threadinfo ffff88002ce6c000, task ffff88002ce55cc0)
    Stack:
    0000000000000246 ffff88002ce55cc0 ffff88002ce6dd58 ffff88001815dc00
    ffff8800185246c0 ffff88001811f618 ffff880029c29d18 ffff88001811f380
    ffff88002ce6dd50 ffffffff814757e4 ffff88002ce6dda0 ffffffff8106ac56
    Call Trace:
    cachefiles_lookup_object+0x78/0xd4 [cachefiles]
    fscache_lookup_object+0x131/0x16d [fscache]
    fscache_object_work_func+0x1bc/0x669 [fscache]
    process_one_work+0x186/0x298
    worker_thread+0xda/0x15d
    kthread+0x84/0x8c
    kernel_thread_helper+0x4/0x10
    RIP cachefiles_walk_to_object+0x436/0x745 [cachefiles]
    ---[ end trace 1d481c9af1804caa ]---

    I tested the uncaching by the following means:

    (1) Create a big file on my NFS server (104857600 bytes).

    (2) Read the file into the cache with md5sum on the NFS client. Look in
    /proc/fs/fscache/stats:

    Pages : mrk=25601 unc=0

    (3) Open the file for read/write ("bash 5<>/warthog/bigfile"). Look in proc
    again:

    Pages : mrk=25601 unc=25601

    Reported-by: Jeff Layton
    Signed-off-by: David Howells
    Reviewed-and-Tested-by: Suresh Jayaraman
    cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    David Howells
     

25 May, 2011

1 commit