04 Feb, 2017

1 commit

  • commit 2aa6ba7b5ad3189cc27f14540aa2f57f0ed8df4b upstream.

    If we try to allocate memory pages to back an xfs_buf that we're trying
    to read, it's possible that we'll be so short on memory that the page
    allocation fails. For a blocking read we'll just wait, but for
    readahead we simply dump all the pages we've collected so far.

    Unfortunately, after dumping the pages we neglect to clear the
    _XBF_PAGES state, which means that the subsequent call to xfs_buf_free
    thinks that b_pages still points to pages we own. It then double-frees
    the b_pages pages.

    This results in screaming about negative page refcounts from the memory
    manager, which xfs oughtn't be triggering. To reproduce this case,
    mount a filesystem where the size of the inodes far outweighs the
    availalble memory (a ~500M inode filesystem on a VM with 300MB memory
    did the trick here) and run bulkstat in parallel with other memory
    eating processes to put a huge load on the system. The "check summary"
    phase of xfs_scrub also works for this purpose.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Eric Sandeen
    Signed-off-by: Greg Kroah-Hartman

    Darrick J. Wong
     

26 Aug, 2016

1 commit

  • xfs_wait_buftarg() waits for all pending I/O, drains the ioend
    completion workqueue and walks the LRU until all buffers in the cache
    have been released. This is traditionally an unmount operation` but the
    mechanism is also reused during filesystem freeze.

    xfs_wait_buftarg() invokes drain_workqueue() as part of the quiesce,
    which is intended more for a shutdown sequence in that it indicates to
    the queue that new operations are not expected once the drain has begun.
    New work jobs after this point result in a WARN_ON_ONCE() and are
    otherwise dropped.

    With filesystem freeze, however, read operations are allowed and can
    proceed during or after the workqueue drain. If such a read occurs
    during the drain sequence, the workqueue infrastructure complains about
    the queued ioend completion work item and drops it on the floor. As a
    result, the buffer remains on the LRU and the freeze never completes.

    Despite the fact that the overall buffer cache cleanup is not necessary
    during freeze, fix up this operation such that it is safe to invoke
    during non-unmount quiesce operations. Replace the drain_workqueue()
    call with flush_workqueue(), which runs a similar serialization on
    pending workqueue jobs without causing new jobs to be dropped. This is
    safe for unmount as unmount independently locks out new operations by
    the time xfs_wait_buftarg() is invoked.

    cc:
    Signed-off-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Brian Foster
     

17 Aug, 2016

1 commit

  • The buffer I/O accounting mechanism tracks async buffers under I/O. As
    an optimization, the buffer I/O count is incremented only once on the
    first async I/O for a given hold cycle of a buffer and decremented once
    the buffer is released to the LRU (or freed).

    xfs_buf_ioacct_dec() has an ASSERT() check for an XBF_ASYNC buffer, but
    we have one or two corner cases where a buffer can be submitted for I/O
    multiple times via different methods in a single hold cycle. If an async
    I/O occurs first, the I/O count is incremented. If a sync I/O occurs
    before the hold count drops, XBF_ASYNC is cleared by the time the I/O
    count is decremented.

    Remove the async assert check from xfs_buf_ioacct_dec() as this is a
    perfectly valid scenario. For the purposes of I/O accounting, we really
    only care about the buffer async state at I/O submission time.

    Discovered-and-analyzed-by: Dave Chinner
    Signed-off-by: Brian Foster
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Brian Foster
     

28 Jul, 2016

1 commit

  • Pull xfs updates from Dave Chinner:
    "The major addition is the new iomap based block mapping
    infrastructure. We've been kicking this about locally for years, but
    there are other filesystems want to use it too (e.g. gfs2). Now it
    is fully working, reviewed and ready for merge and be used by other
    filesystems.

    There are a lot of other fixes and cleanups in the tree, but those are
    XFS internal things and none are of the scale or visibility of the
    iomap changes. See below for details.

    I am likely to send another pull request next week - we're just about
    ready to merge some new functionality (on disk block->owner reverse
    mapping infrastructure), but that's a huge chunk of code (74 files
    changed, 7283 insertions(+), 1114 deletions(-)) so I'm keeping that
    separate to all the "normal" pull request changes so they don't get
    lost in the noise.

    Summary of changes in this update:
    - generic iomap based IO path infrastructure
    - generic iomap based fiemap implementation
    - xfs iomap based Io path implementation
    - buffer error handling fixes
    - tracking of in flight buffer IO for unmount serialisation
    - direct IO and DAX io path separation and simplification
    - shortform directory format definition changes for wider platform
    compatibility
    - various buffer cache fixes
    - cleanups in preparation for rmap merge
    - error injection cleanups and fixes
    - log item format buffer memory allocation restructuring to prevent
    rare OOM reclaim deadlocks
    - sparse inode chunks are now fully supported"

    * tag 'xfs-for-linus-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (53 commits)
    xfs: remove EXPERIMENTAL tag from sparse inode feature
    xfs: bufferhead chains are invalid after end_page_writeback
    xfs: allocate log vector buffers outside CIL context lock
    libxfs: directory node splitting does not have an extra block
    xfs: remove dax code from object file when disabled
    xfs: skip dirty pages in ->releasepage()
    xfs: remove __arch_pack
    xfs: kill xfs_dir2_inou_t
    xfs: kill xfs_dir2_sf_off_t
    xfs: split direct I/O and DAX path
    xfs: direct calls in the direct I/O path
    xfs: stop using generic_file_read_iter for direct I/O
    xfs: split xfs_file_read_iter into buffered and direct I/O helpers
    xfs: remove s_maxbytes enforcement in xfs_file_read_iter
    xfs: kill ioflags
    xfs: don't pass ioflags around in the ioctl path
    xfs: track and serialize in-flight async buffers against unmount
    xfs: exclude never-released buffers from buftarg I/O accounting
    xfs: don't reset b_retries to 0 on every failure
    xfs: remove extraneous buffer flag changes
    ...

    Linus Torvalds
     

20 Jul, 2016

4 commits

  • Dave Chinner
     
  • Newly allocated XFS metadata buffers are added to the LRU once the hold
    count is released, which typically occurs after I/O completion. There is
    no other mechanism at current that tracks the existence or I/O state of
    a new buffer. Further, readahead I/O tends to be submitted
    asynchronously by nature, which means the I/O can remain in flight and
    actually complete long after the calling context is gone. This means
    that file descriptors or any other holds on the filesystem can be
    released, allowing the filesystem to be unmounted while I/O is still in
    flight. When I/O completion occurs, core data structures may have been
    freed, causing completion to run into invalid memory accesses and likely
    to panic.

    This problem is reproduced on XFS via directory readahead. A filesystem
    is mounted, a directory is opened/closed and the filesystem immediately
    unmounted. The open/close cycle triggers a directory readahead that if
    delayed long enough, runs buffer I/O completion after the unmount has
    completed.

    To address this problem, add a mechanism to track all in-flight,
    asynchronous buffers using per-cpu counters in the buftarg. The buffer
    is accounted on the first I/O submission after the current reference is
    acquired and unaccounted once the buffer is returned to the LRU or
    freed. Update xfs_wait_buftarg() to wait on all in-flight I/O before
    walking the LRU list. Once in-flight I/O has completed and the workqueue
    has drained, all new buffers should have been released onto the LRU.

    Signed-off-by: Brian Foster
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Brian Foster
     
  • The upcoming buftarg I/O accounting mechanism maintains a count of
    all buffers that have undergone I/O in the current hold-release
    cycle. Certain buffers associated with core infrastructure (e.g.,
    the xfs_mount superblock buffer, log buffers) are never released,
    however. This means that accounting I/O submission on such buffers
    elevates the buftarg count indefinitely and could lead to lockup on
    unmount.

    Define a new buffer flag to explicitly exclude buffers from buftarg
    I/O accounting. Set the flag on the superblock and associated log
    buffers.

    Signed-off-by: Brian Foster
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Brian Foster
     
  • Fix up a couple places where extra flag manipulation occurs.

    In the first case we clear XBF_ASYNC and then immediately reset it -
    so don't bother clearing in the first place.

    In the 2nd case we are at a point in the function where the buffer
    must already be async, so there is no need to reset it.

    Add consistent spacing around the " | " while we're at it.

    Signed-off-by: Eric Sandeen
    Reviewed-by: Carlos Maiolino
    Signed-off-by: Dave Chinner

    Eric Sandeen
     

21 Jun, 2016

2 commits


10 Jun, 2016

1 commit


08 Jun, 2016

3 commits


01 Jun, 2016

1 commit

  • When we have a lot of metadata to flush from the AIL, the buffer
    list can get very long. The current submission code tries to batch
    submission to optimise IO order of the metadata (i.e. ascending
    block order) to maximise block layer merging or IO to adjacent
    metadata blocks.

    Unfortunately, the method used can result in long lock times
    occurring as buffers locked early on in the buffer list might not be
    dispatched until the end of the IO licst processing. This is because
    sorting does not occur util after the buffer list has been processed
    and the buffers that are going to be submitted are locked. Hence
    when the buffer list is several thousand buffers long, the lock hold
    times before IO dispatch can be significant.

    To fix this, sort the buffer list before we start trying to lock and
    submit buffers. This means we can now submit buffers immediately
    after they are locked, allowing merging to occur immediately on the
    plug and dispatch to occur as quickly as possible. This means there
    is minimal delay between locking the buffer and IO submission
    occuring, hence reducing the worst case lock hold times seen during
    delayed write buffer IO submission signficantly.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Carlos Maiolino
    Signed-off-by: Dave Chinner

    Dave Chinner
     

18 May, 2016

1 commit

  • Reports have surfaced of a lockdep splat complaining about an
    irq-safe -> irq-unsafe locking order in the xfs_buf_bio_end_io() bio
    completion handler. This only occurs when I/O errors are present
    because bp->b_lock is only acquired in this context to protect
    setting an error on the buffer. The problem is that this lock can be
    acquired with the (request_queue) q->queue_lock held. See
    scsi_end_request() or ata_qc_schedule_eh(), for example.

    Replace the locked test/set of b_io_error with a cmpxchg() call.
    This eliminates the need for the lock and thus the lock ordering
    problem goes away.

    Signed-off-by: Brian Foster
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Brian Foster
     

10 Feb, 2016

1 commit


23 Jan, 2016

1 commit

  • Pull more xfs updates from Dave Chinner:
    "This is the second update for XFS that I mentioned in the original
    pull request last week.

    It contains a revert for a suspend regression in 4.4 and a fix for a
    long standing log recovery issue that has been further exposed by all
    the log recovery changes made in the original 4.5 merge.

    There is one more thing in this pull request - one that I forgot to
    merge into the origin. That is, pulling the XFS_IOC_FS[GS]ETXATTR
    ioctl up to the VFS level so that other filesystems can also use it
    for modifying project quota IDs

    Summary:

    - promotion of XFS_IOC_FS[GS]ETXATTR ioctl to the vfs level so that
    it can be shared with other filesystems. The ext4 project quota
    functionality is the first target for this. The commits in this
    series have not been updated with review or final SOB tags because
    the branch they were originally published in was needed by ext4.
    Those tags are:

    Reviewed-by: Theodore Ts'o
    Signed-off-by: Dave Chinner

    - Revert a change that is causing suspend failures.

    - Fix a use-after-free that can occur on log mount failures. Been
    around forever, but now exposed by other changes to log recovery
    made in the first 4.5 merge"

    * tag 'xfs-for-linus-4.5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs:
    xfs: log mount failures don't wait for buffers to be released
    Revert "xfs: clear PF_NOFREEZE for xfsaild kthread"
    xfs: introduce per-inode DAX enablement
    xfs: use FS_XFLAG definitions directly
    fs: XFS_IOC_FS[SG]SETXATTR to FS_IOC_FS[SG]ETXATTR promotion

    Linus Torvalds
     

19 Jan, 2016

2 commits

  • Dave Chinner
     
  • Recently I've been seeing xfs/051 fail on 1k block size filesystems.
    Trying to trace the events during the test lead to the problem going
    away, indicating that it was a race condition that lead to this
    ASSERT failure:

    XFS: Assertion failed: atomic_read(&pag->pag_ref) == 0, file: fs/xfs/xfs_mount.c, line: 156
    .....
    [] xfs_free_perag+0x87/0xb0
    [] xfs_mountfs+0x4d9/0x900
    [] xfs_fs_fill_super+0x3bf/0x4d0
    [] mount_bdev+0x180/0x1b0
    [] xfs_fs_mount+0x15/0x20
    [] mount_fs+0x38/0x170
    [] vfs_kern_mount+0x67/0x120
    [] do_mount+0x218/0xd60
    [] SyS_mount+0x8b/0xd0

    When I finally caught it with tracing enabled, I saw that AG 2 had
    an elevated reference count and a buffer was responsible for it. I
    tracked down the specific buffer, and found that it was missing the
    final reference count release that would put it back on the LRU and
    hence be found by xfs_wait_buftarg() calls in the log mount failure
    handling.

    The last four traces for the buffer before the assert were (trimmed
    for relevance)

    kworker/0:1-5259 xfs_buf_iodone: hold 2 lock 0 flags ASYNC
    kworker/0:1-5259 xfs_buf_ioerror: hold 2 lock 0 error -5
    mount-7163 xfs_buf_lock_done: hold 2 lock 0 flags ASYNC
    mount-7163 xfs_buf_unlock: hold 2 lock 1 flags ASYNC

    This is an async write that is completing, so there's nobody waiting
    for it directly. Hence we call xfs_buf_relse() once all the
    processing is complete. That does:

    static inline void xfs_buf_relse(xfs_buf_t *bp)
    {
    xfs_buf_unlock(bp);
    xfs_buf_rele(bp);
    }

    Now, it's clear that mount is waiting on the buffer lock, and that
    it has been released by xfs_buf_relse() and gained by mount. This is
    expected, because at this point the mount process is in
    xfs_buf_delwri_submit() waiting for all the IO it submitted to
    complete.

    The mount process, however, is waiting on the lock for the buffer
    because it is in xfs_buf_delwri_submit(). This waits for IO
    completion, but it doesn't wait for the buffer reference owned by
    the IO to go away. The mount process collects all the completions,
    fails the log recovery, and the higher level code then calls
    xfs_wait_buftarg() to free all the remaining buffers in the
    filesystem.

    The issue is that on unlocking the buffer, the scheduler has decided
    that the mount process has higher priority than the the kworker
    thread that is running the IO completion, and so immediately
    switched contexts to the mount process from the semaphore unlock
    code, hence preventing the kworker thread from finishing the IO
    completion and releasing the IO reference to the buffer.

    Hence by the time that xfs_wait_buftarg() is run, the buffer still
    has an active reference and so isn't on the LRU list that the
    function walks to free the remaining buffers. Hence we miss that
    buffer and continue onwards to tear down the mount structures,
    at which time we get find a stray reference count on the perag
    structure. On a non-debug kernel, this will be ignored and the
    structure torn down and freed. Hence when the kworker thread is then
    rescheduled and the buffer released and freed, it will access a
    freed perag structure.

    The problem here is that when the log mount fails, we still need to
    quiesce the log to ensure that the IO workqueues have returned to
    idle before we run xfs_wait_buftarg(). By synchronising the
    workqueues, we ensure that all IO completions are fully processed,
    not just to the point where buffers have been unlocked. This ensures
    we don't end up in the situation above.

    cc: # 3.18
    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Dave Chinner
     

14 Jan, 2016

1 commit

  • Pull xfs updates from Dave Chinner:
    "There's not a lot in this - the main addition is the CRC validation of
    the entire region of the log that the will be recovered, along with
    several log recovery fixes. Most of the rest is small bug fixes and
    cleanups.

    I have three bug fixes still pending, all that address recently fixed
    regressions that I will send to next week after they've had some time
    in for-next.

    Summary:
    - extensive CRC validation during log recovery
    - several log recovery bug fixes
    - Various DAX support fixes
    - AGFL size calculation fix
    - various cleanups in preparation for new functionality
    - project quota ENOSPC notification via netlink
    - tracing and debug improvements"

    * tag 'xfs-for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (26 commits)
    xfs: handle dquot buffer readahead in log recovery correctly
    xfs: inode recovery readahead can race with inode buffer creation
    xfs: eliminate committed arg from xfs_bmap_finish
    xfs: bmapbt checking on debug kernels too expensive
    xfs: add tracepoints to readpage calls
    xfs: debug mode log record crc error injection
    xfs: detect and trim torn writes during log recovery
    xfs: fix recursive splice read locking with DAX
    xfs: Don't use reserved blocks for data blocks with DAX
    XFS: Use a signed return type for suffix_kstrtoint()
    libxfs: refactor short btree block verification
    libxfs: pack the agfl header structure so XFS_AGFL_SIZE is correct
    libxfs: use a convenience variable instead of open-coding the fork
    xfs: fix log ticket type printing
    libxfs: make xfs_alloc_fix_freelist non-static
    xfs: make xfs_buf_ioend_async() static
    xfs: send warning of project quota to userspace via netlink
    xfs: get mp from bma->ip in xfs_bmap code
    xfs: print name of verifier if it fails
    libxfs: Optimize the loop for xfs_bitmap_empty
    ...

    Linus Torvalds
     

12 Jan, 2016

1 commit

  • When we do inode readahead in log recovery, we do can do the
    readahead before we've replayed the icreate transaction that stamps
    the buffer with inode cores. The inode readahead verifier catches
    this and marks the buffer as !done to indicate that it doesn't yet
    contain valid inodes.

    In adding buffer error notification (i.e. setting b_error = -EIO at
    the same time as as we clear the done flag) to such a readahead
    verifier failure, we can then get subsequent inode recovery failing
    with this error:

    XFS (dm-0): metadata I/O error: block 0xa00060 ("xlog_recover_do..(read#2)") error 5 numblks 32

    This occurs when readahead completion races with icreate item replay
    such as:

    inode readahead
    find buffer
    lock buffer
    submit RA io
    ....
    icreate recovery
    xfs_trans_get_buffer
    find buffer
    lock buffer

    .....

    fails verifier
    clear XBF_DONE
    set bp->b_error = -EIO
    release and unlock buffer

    icreate initialises buffer
    marks buffer as done
    adds buffer to delayed write queue
    releases buffer

    At this point, we have an initialised inode buffer that is up to
    date but has an -EIO state registered against it. When we finally
    get to recovering an inode in that buffer:

    inode item recovery
    xfs_trans_read_buffer
    find buffer
    lock buffer
    sees XBF_DONE is set, returns buffer
    sees bp->b_error is set
    fail log recovery!

    Essentially, we need xfs_trans_get_buf_map() to clear the error status of
    the buffer when doing a lookup. This function returns uninitialised
    buffers, so the buffer returned can not be in an error state and
    none of the code that uses this function expects b_error to be set
    on return. Indeed, there is an ASSERT(!bp->b_error); in the
    transaction case in xfs_trans_get_buf_map() that would have caught
    this if log recovery used transactions....

    This patch firstly changes the inode readahead failure to set -EIO
    on the buffer, and secondly changes xfs_buf_get_map() to never
    return a buffer with an error state set so this first change doesn't
    cause unexpected log recovery failures.

    cc: # 3.12 - current
    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Dave Chinner
     

07 Jan, 2016

1 commit


04 Jan, 2016

1 commit


12 Oct, 2015

3 commits

  • Dave Chinner
     
  • This patch modifies the stats counting macros and the callers
    to those macros to properly increment, decrement, and add-to
    the xfs stats counts. The counts for global and per-fs stats
    are correctly advanced, and cleared by writing a "1" to the
    corresponding clear file.

    global counts: /sys/fs/xfs/stats/stats
    per-fs counts: /sys/fs/xfs/sda*/stats/stats

    global clear: /sys/fs/xfs/stats/stats_clear
    per-fs clear: /sys/fs/xfs/sda*/stats/stats_clear

    [dchinner: cleaned up macro variables, removed CONFIG_FS_PROC around
    stats structures and macros. ]

    Signed-off-by: Bill O'Donnell
    Reviewed-by: Eric Sandeen
    Signed-off-by: Dave Chinner

    Bill O'Donnell
     
  • This patch adds comm name and pid to warning messages printed by
    kmem_alloc(), kmem_zone_alloc() and xfs_buf_allocate_memory().
    This will help telling which memory allocations (e.g. kernel worker
    threads, OOM victim tasks, neither) are stalling because these functions
    are passing __GFP_NOWARN which suppresses not only backtrace but comm name
    and pid.

    Signed-off-by: Tetsuo Handa
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Tetsuo Handa
     

08 Sep, 2015

1 commit

  • Pull xfs updates from Dave Chinner:
    "There isn't a whole lot to this update - it's mostly bug fixes and
    they are spread pretty much all over XFS. There are some corruption
    fixes, some fixes for log recovery, some fixes that prevent unount
    from hanging, a lockdep annotation rework for inode locking to prevent
    false positives and the usual random bunch of cleanups and minor
    improvements.

    Deatils:

    - large rework of EFI/EFD lifecycle handling to fix log recovery
    corruption issues, crashes and unmount hangs

    - separate metadata UUID on disk to enable changing boot label UUID
    for v5 filesystems

    - fixes for gcc miscompilation on certain platforms and optimisation
    levels

    - remote attribute allocation and recovery corruption fixes

    - inode lockdep annotation rework to fix bugs with too many
    subclasses

    - directory inode locking changes to prevent lockdep false positives

    - a handful of minor corruption fixes

    - various other small cleanups and bug fixes"

    * tag 'xfs-for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (42 commits)
    xfs: fix error gotos in xfs_setattr_nonsize
    xfs: add mssing inode cache attempts counter increment
    xfs: return errors from partial I/O failures to files
    libxfs: bad magic number should set da block buffer error
    xfs: fix non-debug build warnings
    xfs: collapse allocsize and biosize mount option handling
    xfs: Fix file type directory corruption for btree directories
    xfs: lockdep annotations throw warnings on non-debug builds
    xfs: Fix uninitialized return value in xfs_alloc_fix_freelist()
    xfs: inode lockdep annotations broke non-lockdep build
    xfs: flush entire file on dio read/write to cached file
    xfs: Fix xfs_attr_leafblock definition
    libxfs: readahead of dir3 data blocks should use the read verifier
    xfs: stop holding ILOCK over filldir callbacks
    xfs: clean up inode lockdep annotations
    xfs: swap leaf buffer into path struct atomically during path shift
    xfs: relocate sparse inode mount warning
    xfs: dquots should be stamped with sb_meta_uuid
    xfs: log recovery needs to validate against sb_meta_uuid
    xfs: growfs not aware of sb_meta_uuid
    ...

    Linus Torvalds
     

25 Aug, 2015

2 commits


29 Jul, 2015

2 commits

  • Currently we have two different ways to signal an I/O error on a BIO:

    (1) by clearing the BIO_UPTODATE flag
    (2) by returning a Linux errno value to the bi_end_io callback

    The first one has the drawback of only communicating a single possible
    error (-EIO), and the second one has the drawback of not beeing persistent
    when bios are queued up, and are not passed along from child to parent
    bio in the ever more popular chaining scenario. Having both mechanisms
    available has the additional drawback of utterly confusing driver authors
    and introducing bugs where various I/O submitters only deal with one of
    them, and the others have to add boilerplate code to deal with both kinds
    of error returns.

    So add a new bi_error field to store an errno value directly in struct
    bio and remove the existing mechanisms to clean all this up.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Reviewed-by: NeilBrown
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • The second and subsequent lines of multi-line logging messages
    are not prefixed with the same information as the first line.

    Separate messages with newlines into multiple calls to ensure
    consistent prefixing and allow easier grep use.

    Signed-off-by: Joe Perches
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Joe Perches
     

22 Jun, 2015

1 commit


13 Feb, 2015

2 commits

  • Currently, the isolate callback passed to the list_lru_walk family of
    functions is supposed to just delete an item from the list upon returning
    LRU_REMOVED or LRU_REMOVED_RETRY, while nr_items counter is fixed by
    __list_lru_walk_one after the callback returns. Since the callback is
    allowed to drop the lock after removing an item (it has to return
    LRU_REMOVED_RETRY then), the nr_items can be less than the actual number
    of elements on the list even if we check them under the lock. This makes
    it difficult to move items from one list_lru_one to another, which is
    required for per-memcg list_lru reparenting - we can't just splice the
    lists, we have to move entries one by one.

    This patch therefore introduces helpers that must be used by callback
    functions to isolate items instead of raw list_del/list_move. These are
    list_lru_isolate and list_lru_isolate_move. They not only remove the
    entry from the list, but also fix the nr_items counter, making sure
    nr_items always reflects the actual number of elements on the list if
    checked under the appropriate lock.

    Signed-off-by: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Kmem accounting of memcg is unusable now, because it lacks slab shrinker
    support. That means when we hit the limit we will get ENOMEM w/o any
    chance to recover. What we should do then is to call shrink_slab, which
    would reclaim old inode/dentry caches from this cgroup. This is what
    this patch set is intended to do.

    Basically, it does two things. First, it introduces the notion of
    per-memcg slab shrinker. A shrinker that wants to reclaim objects per
    cgroup should mark itself as SHRINKER_MEMCG_AWARE. Then it will be
    passed the memory cgroup to scan from in shrink_control->memcg. For
    such shrinkers shrink_slab iterates over the whole cgroup subtree under
    the target cgroup and calls the shrinker for each kmem-active memory
    cgroup.

    Secondly, this patch set makes the list_lru structure per-memcg. It's
    done transparently to list_lru users - everything they have to do is to
    tell list_lru_init that they want memcg-aware list_lru. Then the
    list_lru will automatically distribute objects among per-memcg lists
    basing on which cgroup the object is accounted to. This way to make FS
    shrinkers (icache, dcache) memcg-aware we only need to make them use
    memcg-aware list_lru, and this is what this patch set does.

    As before, this patch set only enables per-memcg kmem reclaim when the
    pressure goes from memory.limit, not from memory.kmem.limit. Handling
    memory.kmem.limit is going to be tricky due to GFP_NOFS allocations, and
    it is still unclear whether we will have this knob in the unified
    hierarchy.

    This patch (of 9):

    NUMA aware slab shrinkers use the list_lru structure to distribute
    objects coming from different NUMA nodes to different lists. Whenever
    such a shrinker needs to count or scan objects from a particular node,
    it issues commands like this:

    count = list_lru_count_node(lru, sc->nid);
    freed = list_lru_walk_node(lru, sc->nid, isolate_func,
    isolate_arg, &sc->nr_to_scan);

    where sc is an instance of the shrink_control structure passed to it
    from vmscan.

    To simplify this, let's add special list_lru functions to be used by
    shrinkers, list_lru_shrink_count() and list_lru_shrink_walk(), which
    consolidate the nid and nr_to_scan arguments in the shrink_control
    structure.

    This will also allow us to avoid patching shrinkers that use list_lru
    when we make shrink_slab() per-memcg - all we will have to do is extend
    the shrink_control structure to include the target memcg and make
    list_lru_shrink_{count,walk} handle this appropriately.

    Signed-off-by: Vladimir Davydov
    Suggested-by: Dave Chinner
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Greg Thelen
    Cc: Glauber Costa
    Cc: Alexander Viro
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

04 Dec, 2014

2 commits

  • Conflicts:
    fs/xfs/xfs_iops.c

    Dave Chinner
     
  • XFS traditionally sends all buffer I/O completion work to a single
    workqueue. This includes metadata buffer completion and log buffer
    completion. The log buffer completion requires a high priority queue to
    prevent stalls due to log forces getting stuck behind other queued work.

    Rather than continue to prioritize all buffer I/O completion due to the
    needs of log completion, split log buffer completion off to
    m_log_workqueue and move the high priority flag from m_buf_workqueue to
    m_log_workqueue.

    Add a b_ioend_wq wq pointer to xfs_buf to allow completion workqueue
    customization on a per-buffer basis. Initialize b_ioend_wq to
    m_buf_workqueue by default in the generic buffer I/O submission path.
    Finally, override the default wq with the high priority m_log_workqueue
    in the log buffer I/O submission path.

    Signed-off-by: Brian Foster
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Brian Foster
     

28 Nov, 2014

3 commits