25 May, 2011

1 commit


31 Mar, 2011

1 commit


23 Jul, 2010

3 commits

  • fscache no longer uses slow-work. Drop references to it.

    Signed-off-by: Tejun Heo
    Acked-by: David Howells

    Tejun Heo
     
  • Make fscache operation to use only workqueue instead of combination of
    workqueue and slow-work. FSCACHE_OP_SLOW is dropped and
    FSCACHE_OP_FAST is renamed to FSCACHE_OP_ASYNC and uses newly added
    fscache_op_wq workqueue to execute op->processor().
    fscache_operation_init_slow() is dropped and fscache_operation_init()
    now takes @processor argument directly.

    * Unbound workqueue is used.

    * fscache_retrieval_work() is no longer necessary as OP_ASYNC now does
    the equivalent thing.

    * sysctl fscache.operation_max_active added to control concurrency.
    The default value is nr_cpus clamped between 2 and
    WQ_UNBOUND_MAX_ACTIVE.

    * debugfs support is dropped for now. Tracing API based debug
    facility is planned to be added.

    Signed-off-by: Tejun Heo
    Acked-by: David Howells

    Tejun Heo
     
  • Make fscache object state transition callbacks use workqueue instead
    of slow-work. New dedicated unbound CPU workqueue fscache_object_wq
    is created. get/put callbacks are renamed and modified to take
    @object and called directly from the enqueue wrapper and the work
    function. While at it, make all open coded instances of get/put to
    use fscache_get/put_object().

    * Unbound workqueue is used.

    * work_busy() output is printed instead of slow-work flags in object
    debugging outputs. They mean basically the same thing bit-for-bit.

    * sysctl fscache.object_max_active added to control concurrency. The
    default value is nr_cpus clamped between 4 and
    WQ_UNBOUND_MAX_ACTIVE.

    * slow_work_sleep_till_thread_needed() is replaced with fscache
    private implementation fscache_object_sleep_till_congested() which
    waits on fscache_object_wq congestion.

    * debugfs support is dropped for now. Tracing API based debug
    facility is planned to be added.

    Signed-off-by: Tejun Heo
    Acked-by: David Howells

    Tejun Heo
     

30 Mar, 2010

1 commit


20 Nov, 2009

7 commits

  • Catch an overly long wait for an old, dying active object when we want to
    replace it with a new one. The probability is that all the slow-work threads
    are hogged, and the delete can't get a look in.

    What we do instead is:

    (1) if there's nothing in the slow work queue, we sleep until either the dying
    object has finished dying or there is something in the slow work queue
    behind which we can queue our object.

    (2) if there is something in the slow work queue, we return ETIMEDOUT to
    fscache_lookup_object(), which then puts us back on the slow work queue,
    presumably behind the deletion that we're blocked by. We are then
    deferred for a while until we work our way back through the queue -
    without blocking a slow-work thread unnecessarily.

    A backtrace similar to the following may appear in the log without this patch:

    INFO: task kslowd004:5711 blocked for more than 120 seconds.
    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    kslowd004 D 0000000000000000 0 5711 2 0x00000080
    ffff88000340bb80 0000000000000046 ffff88002550d000 0000000000000000
    ffff88002550d000 0000000000000007 ffff88000340bfd8 ffff88002550d2a8
    000000000000ddf0 00000000000118c0 00000000000118c0 ffff88002550d2a8
    Call Trace:
    [] ? trace_hardirqs_on+0xd/0xf
    [] ? cachefiles_wait_bit+0x0/0xd [cachefiles]
    [] cachefiles_wait_bit+0x9/0xd [cachefiles]
    [] __wait_on_bit+0x43/0x76
    [] ? ext3_xattr_get+0x1ec/0x270
    [] out_of_line_wait_on_bit+0x69/0x74
    [] ? cachefiles_wait_bit+0x0/0xd [cachefiles]
    [] ? wake_bit_function+0x0/0x2e
    [] cachefiles_mark_object_active+0x203/0x23b [cachefiles]
    [] cachefiles_walk_to_object+0x558/0x827 [cachefiles]
    [] cachefiles_lookup_object+0xac/0x12a [cachefiles]
    [] fscache_lookup_object+0x1c7/0x214 [fscache]
    [] fscache_object_state_machine+0xa5/0x52d [fscache]
    [] fscache_object_slow_work_execute+0x5f/0xa0 [fscache]
    [] slow_work_execute+0x18f/0x2d1
    [] slow_work_thread+0x1c5/0x308
    [] ? autoremove_wake_function+0x0/0x34
    [] ? slow_work_thread+0x0/0x308
    [] kthread+0x7a/0x82
    [] child_rip+0xa/0x20
    [] ? restore_args+0x0/0x30
    [] ? kthread+0x0/0x82
    [] ? child_rip+0x0/0x20
    1 lock held by kslowd004/5711:
    #0: (&sb->s_type->i_mutex_key#7/1){+.+.+.}, at: [] cachefiles_walk_to_object+0x1b3/0x827 [cachefiles]

    Signed-off-by: David Howells

    David Howells
     
  • cachefiles_write_page() writes a full page to the backing file for the last
    page of the netfs file, even if the netfs file's last page is only a partial
    page.

    This causes the EOF on the backing file to be extended beyond the EOF of the
    netfs, and thus the backing file will be truncated by cachefiles_attr_changed()
    called from cachefiles_lookup_object().

    So we need to limit the write we make to the backing file on that last page
    such that it doesn't push the EOF too far.

    Also, if a backing file that has a partial page at the end is expanded, we
    discard the partial page and refetch it on the basis that we then have a hole
    in the file with invalid data, and should the power go out... A better way to
    deal with this could be to record a note that the partial page contains invalid
    data until the correct data is written into it.

    This isn't a problem for netfs's that discard the whole backing file if the
    file size changes (such as NFS).

    Signed-off-by: David Howells

    David Howells
     
  • Start processing an object's operations when that object moves into the DYING
    state as the object cannot be destroyed until all its outstanding operations
    have completed.

    Furthermore, make sure that read and allocation operations handle being woken
    up on a dead object. Such events are recorded in the Allocs.abt and
    Retrvls.abt statistics as viewable through /proc/fs/fscache/stats.

    The code for waiting for object activation for the read and allocation
    operations is also extracted into its own function as it is much the same in
    all cases, differing only in the stats incremented.

    Signed-off-by: David Howells

    David Howells
     
  • Handle netfs pages that the vmscan algorithm wants to evict from the pagecache
    under OOM conditions, but that are waiting for write to the cache. Under these
    conditions, vmscan calls the releasepage() function of the netfs, asking if a
    page can be discarded.

    The problem is typified by the following trace of a stuck process:

    kslowd005 D 0000000000000000 0 4253 2 0x00000080
    ffff88001b14f370 0000000000000046 ffff880020d0d000 0000000000000007
    0000000000000006 0000000000000001 ffff88001b14ffd8 ffff880020d0d2a8
    000000000000ddf0 00000000000118c0 00000000000118c0 ffff880020d0d2a8
    Call Trace:
    [] __fscache_wait_on_page_write+0x8b/0xa7 [fscache]
    [] ? autoremove_wake_function+0x0/0x34
    [] ? __fscache_check_page_write+0x63/0x70 [fscache]
    [] nfs_fscache_release_page+0x4e/0xc4 [nfs]
    [] nfs_release_page+0x3c/0x41 [nfs]
    [] try_to_release_page+0x32/0x3b
    [] shrink_page_list+0x316/0x4ac
    [] shrink_inactive_list+0x392/0x67c
    [] ? __mutex_unlock_slowpath+0x100/0x10b
    [] ? trace_hardirqs_on_caller+0x10c/0x130
    [] ? mutex_unlock+0x9/0xb
    [] shrink_list+0x8d/0x8f
    [] shrink_zone+0x278/0x33c
    [] ? ktime_get_ts+0xad/0xba
    [] try_to_free_pages+0x22e/0x392
    [] ? isolate_pages_global+0x0/0x212
    [] __alloc_pages_nodemask+0x3dc/0x5cf
    [] grab_cache_page_write_begin+0x65/0xaa
    [] ext3_write_begin+0x78/0x1eb
    [] generic_file_buffered_write+0x109/0x28c
    [] ? current_fs_time+0x22/0x29
    [] __generic_file_aio_write+0x350/0x385
    [] ? generic_file_aio_write+0x4a/0xae
    [] generic_file_aio_write+0x60/0xae
    [] do_sync_write+0xe3/0x120
    [] ? autoremove_wake_function+0x0/0x34
    [] ? __dentry_open+0x1a5/0x2b8
    [] ? dentry_open+0x82/0x89
    [] cachefiles_write_page+0x298/0x335 [cachefiles]
    [] fscache_write_op+0x178/0x2c2 [fscache]
    [] fscache_op_execute+0x7a/0xd1 [fscache]
    [] slow_work_execute+0x18f/0x2d1
    [] slow_work_thread+0x1c5/0x308
    [] ? autoremove_wake_function+0x0/0x34
    [] ? slow_work_thread+0x0/0x308
    [] kthread+0x7a/0x82
    [] child_rip+0xa/0x20
    [] ? restore_args+0x0/0x30
    [] ? tg_shares_up+0x171/0x227
    [] ? kthread+0x0/0x82
    [] ? child_rip+0x0/0x20

    In the above backtrace, the following is happening:

    (1) A page storage operation is being executed by a slow-work thread
    (fscache_write_op()).

    (2) FS-Cache farms the operation out to the cache to perform
    (cachefiles_write_page()).

    (3) CacheFiles is then calling Ext3 to perform the actual write, using Ext3's
    standard write (do_sync_write()) under KERNEL_DS directly from the netfs
    page.

    (4) However, for Ext3 to perform the write, it must allocate some memory, in
    particular, it must allocate at least one page cache page into which it
    can copy the data from the netfs page.

    (5) Under OOM conditions, the memory allocator can't immediately come up with
    a page, so it uses vmscan to find something to discard
    (try_to_free_pages()).

    (6) vmscan finds a clean netfs page it might be able to discard (possibly the
    one it's trying to write out).

    (7) The netfs is called to throw the page away (nfs_release_page()) - but it's
    called with __GFP_WAIT, so the netfs decides to wait for the store to
    complete (__fscache_wait_on_page_write()).

    (8) This blocks a slow-work processing thread - possibly against itself.

    The system ends up stuck because it can't write out any netfs pages to the
    cache without allocating more memory.

    To avoid this, we make FS-Cache cancel some writes that aren't in the middle of
    actually being performed. This means that some data won't make it into the
    cache this time. To support this, a new FS-Cache function is added
    fscache_maybe_release_page() that replaces what the netfs releasepage()
    functions used to do with respect to the cache.

    The decisions fscache_maybe_release_page() makes are counted and displayed
    through /proc/fs/fscache/stats on a line labelled "VmScan". There are four
    counters provided: "nos=N" - pages that weren't pending storage; "gon=N" -
    pages that were pending storage when we first looked, but weren't by the time
    we got the object lock; "bsy=N" - pages that we ignored as they were actively
    being written when we looked; and "can=N" - pages that we cancelled the storage
    of.

    What I'd really like to do is alter the behaviour of the cancellation
    heuristics, depending on how necessary it is to expel pages. If there are
    plenty of other pages that aren't waiting to be written to the cache that
    could be ejected first, then it would be nice to hold up on immediate
    cancellation of cache writes - but I don't see a way of doing that.

    Signed-off-by: David Howells

    David Howells
     
  • FS-Cache has two structs internally for keeping track of the internal state of
    a cached file: the fscache_cookie struct, which represents the netfs's state,
    and fscache_object struct, which represents the cache's state. Each has a
    pointer that points to the other (when both are in existence), and each has a
    spinlock for pointer maintenance.

    Since netfs operations approach these structures from the cookie side, they get
    the cookie lock first, then the object lock. Cache operations, on the other
    hand, approach from the object side, and get the object lock first. It is not
    then permitted for a cache operation to get the cookie lock whilst it is
    holding the object lock lest deadlock occur; instead, it must do one of two
    things:

    (1) increment the cookie usage counter, drop the object lock and then get both
    locks in order, or

    (2) simply hold the object lock as certain parts of the cookie may not be
    altered whilst the object lock is held.

    It is also not permitted to follow either pointer without holding the lock at
    the end you start with. To break the pointers between the cookie and the
    object, both locks must be held.

    fscache_write_op(), however, violates the locking rules: It attempts to get the
    cookie lock without (a) checking that the cookie pointer is a valid pointer,
    and (b) holding the object lock to protect the cookie pointer whilst it follows
    it. This is so that it can access the pending page store tree without
    interference from __fscache_write_page().

    This is fixed by splitting the cookie lock, such that the page store tracking
    tree is protected by its own lock, and checking that the cookie pointer is
    non-NULL before we attempt to follow it whilst holding the object lock.

    The new lock is subordinate to both the cookie lock and the object lock, and so
    should be taken after those.

    Signed-off-by: David Howells

    David Howells
     
  • Allow the current state of all fscache objects to be dumped by doing:

    cat /proc/fs/fscache/objects

    By default, all objects and all fields will be shown. This can be restricted
    by adding a suitable key to one of the caller's keyrings (such as the session
    keyring):

    keyctl add user fscache:objlist "" @s

    The are:

    K Show hexdump of object key (don't show if not given)
    A Show hexdump of object aux data (don't show if not given)

    And paired restrictions:

    C Show objects that have a cookie
    c Show objects that don't have a cookie
    B Show objects that are busy
    b Show objects that aren't busy
    W Show objects that have pending writes
    w Show objects that don't have pending writes
    R Show objects that have outstanding reads
    r Show objects that don't have outstanding reads
    S Show objects that have slow work queued
    s Show objects that don't have slow work queued

    If neither side of a restriction pair is given, then both are implied. For
    example:

    keyctl add user fscache:objlist KB @s

    shows objects that are busy, and lists their object keys, but does not dump
    their auxiliary data. It also implies "CcWwRrSs", but as 'B' is given, 'b' is
    not implied.

    Signed-off-by: David Howells

    David Howells
     
  • Annotate slow-work runqueue proc lines for FS-Cache work items. Objects
    include the object ID and the state. Operations include the object ID, the
    operation ID and the operation type and state.

    Signed-off-by: David Howells

    David Howells
     

03 Apr, 2009

2 commits

  • Make FS-Cache create its /proc interface and present various statistical
    information through it. Also provide the functions for updating this
    information.

    These features are enabled by:

    CONFIG_FSCACHE_PROC
    CONFIG_FSCACHE_STATS
    CONFIG_FSCACHE_HISTOGRAM

    The /proc directory for FS-Cache is also exported so that caching modules can
    add their own statistics there too.

    The FS-Cache module is loadable at this point, and the statistics files can be
    examined by userspace:

    cat /proc/fs/fscache/stats
    cat /proc/fs/fscache/histogram

    Signed-off-by: David Howells
    Acked-by: Steve Dickson
    Acked-by: Trond Myklebust
    Acked-by: Al Viro
    Tested-by: Daire Byrne

    David Howells
     
  • Add the API for a generic facility (FS-Cache) by which caches may declare them
    selves open for business, and may obtain work to be done from network
    filesystems. The header file is included by:

    #include

    Documentation for the API is also added to:

    Documentation/filesystems/caching/backend-api.txt

    This API is not usable without the implementation of the utility functions
    which will be added in further patches.

    Signed-off-by: David Howells
    Acked-by: Steve Dickson
    Acked-by: Trond Myklebust
    Acked-by: Al Viro
    Tested-by: Daire Byrne

    David Howells