12 Oct, 2016

1 commit

  • The mapping_set_error() helper sets the correct AS_ flag for the mapping
    so there is no reason to open code it. Use the helper directly.

    [akpm@linux-foundation.org: be honest about conversion from -ENXIO to -EIO]
    Link: http://lkml.kernel.org/r/20160912111608.2588-2-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

28 Sep, 2016

1 commit

  • __getblk_slow() was exported to modules in commit 3b5e6454aaf6
    ("fs/buffer.c: support buffer cache allocations with gfp modifiers").
    This seems to have been a mistake, as no users were introduced nor was
    the function declared in a header. Change it back to 'static'.

    Signed-off-by: Eric Biggers
    Signed-off-by: Al Viro

    Eric Biggers
     

28 Jul, 2016

1 commit

  • Pull xfs updates from Dave Chinner:
    "The major addition is the new iomap based block mapping
    infrastructure. We've been kicking this about locally for years, but
    there are other filesystems want to use it too (e.g. gfs2). Now it
    is fully working, reviewed and ready for merge and be used by other
    filesystems.

    There are a lot of other fixes and cleanups in the tree, but those are
    XFS internal things and none are of the scale or visibility of the
    iomap changes. See below for details.

    I am likely to send another pull request next week - we're just about
    ready to merge some new functionality (on disk block->owner reverse
    mapping infrastructure), but that's a huge chunk of code (74 files
    changed, 7283 insertions(+), 1114 deletions(-)) so I'm keeping that
    separate to all the "normal" pull request changes so they don't get
    lost in the noise.

    Summary of changes in this update:
    - generic iomap based IO path infrastructure
    - generic iomap based fiemap implementation
    - xfs iomap based Io path implementation
    - buffer error handling fixes
    - tracking of in flight buffer IO for unmount serialisation
    - direct IO and DAX io path separation and simplification
    - shortform directory format definition changes for wider platform
    compatibility
    - various buffer cache fixes
    - cleanups in preparation for rmap merge
    - error injection cleanups and fixes
    - log item format buffer memory allocation restructuring to prevent
    rare OOM reclaim deadlocks
    - sparse inode chunks are now fully supported"

    * tag 'xfs-for-linus-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (53 commits)
    xfs: remove EXPERIMENTAL tag from sparse inode feature
    xfs: bufferhead chains are invalid after end_page_writeback
    xfs: allocate log vector buffers outside CIL context lock
    libxfs: directory node splitting does not have an extra block
    xfs: remove dax code from object file when disabled
    xfs: skip dirty pages in ->releasepage()
    xfs: remove __arch_pack
    xfs: kill xfs_dir2_inou_t
    xfs: kill xfs_dir2_sf_off_t
    xfs: split direct I/O and DAX path
    xfs: direct calls in the direct I/O path
    xfs: stop using generic_file_read_iter for direct I/O
    xfs: split xfs_file_read_iter into buffered and direct I/O helpers
    xfs: remove s_maxbytes enforcement in xfs_file_read_iter
    xfs: kill ioflags
    xfs: don't pass ioflags around in the ioctl path
    xfs: track and serialize in-flight async buffers against unmount
    xfs: exclude never-released buffers from buftarg I/O accounting
    xfs: don't reset b_retries to 0 on every failure
    xfs: remove extraneous buffer flag changes
    ...

    Linus Torvalds
     

27 Jul, 2016

2 commits

  • Pull block driver updates from Jens Axboe:
    "This branch also contains core changes. I've come to the conclusion
    that from 4.9 and forward, I'll be doing just a single branch. We
    often have dependencies between core and drivers, and it's hard to
    always split them up appropriately without pulling core into drivers
    when that happens.

    That said, this contains:

    - separate secure erase type for the core block layer, from
    Christoph.

    - set of discard fixes, from Christoph.

    - bio shrinking fixes from Christoph, as a followup up to the
    op/flags change in the core branch.

    - map and append request fixes from Christoph.

    - NVMeF (NVMe over Fabrics) code from Christoph. This is pretty
    exciting!

    - nvme-loop fixes from Arnd.

    - removal of ->driverfs_dev from Dan, after providing a
    device_add_disk() helper.

    - bcache fixes from Bhaktipriya and Yijing.

    - cdrom subchannel read fix from Vchannaiah.

    - set of lightnvm updates from Wenwei, Matias, Johannes, and Javier.

    - set of drbd updates and fixes from Fabian, Lars, and Philipp.

    - mg_disk error path fix from Bart.

    - user notification for failed device add for loop, from Minfei.

    - NVMe in general:
    + NVMe delay quirk from Guilherme.
    + SR-IOV support and command retry limits from Keith.
    + fix for memory-less NUMA node from Masayoshi.
    + use UINT_MAX for discard sectors, from Minfei.
    + cancel IO fixes from Ming.
    + don't allocate unused major, from Neil.
    + error code fixup from Dan.
    + use constants for PSDT/FUSE from James.
    + variable init fix from Jay.
    + fabrics fixes from Ming, Sagi, and Wei.
    + various fixes"

    * 'for-4.8/drivers' of git://git.kernel.dk/linux-block: (115 commits)
    nvme/pci: Provide SR-IOV support
    nvme: initialize variable before logical OR'ing it
    block: unexport various bio mapping helpers
    scsi/osd: open code blk_make_request
    target: stop using blk_make_request
    block: simplify and export blk_rq_append_bio
    block: ensure bios return from blk_get_request are properly initialized
    virtio_blk: use blk_rq_map_kern
    memstick: don't allow REQ_TYPE_BLOCK_PC requests
    block: shrink bio size again
    block: simplify and cleanup bvec pool handling
    block: get rid of bio_rw and READA
    block: don't ignore -EOPNOTSUPP blkdev_issue_write_same
    block: introduce BLKDEV_DISCARD_ZERO to fix zeroout
    NVMe: don't allocate unused nvme_major
    nvme: avoid crashes when node 0 is memoryless node.
    nvme: Limit command retries
    loop: Make user notify for adding loop device failed
    nvme-loop: fix nvme-loop Kconfig dependencies
    nvmet: fix return value check in nvmet_subsys_alloc()
    ...

    Linus Torvalds
     
  • Pull core block updates from Jens Axboe:

    - the big change is the cleanup from Mike Christie, cleaning up our
    uses of command types and modified flags. This is what will throw
    some merge conflicts

    - regression fix for the above for btrfs, from Vincent

    - following up to the above, better packing of struct request from
    Christoph

    - a 2038 fix for blktrace from Arnd

    - a few trivial/spelling fixes from Bart Van Assche

    - a front merge check fix from Damien, which could cause issues on
    SMR drives

    - Atari partition fix from Gabriel

    - convert cfq to highres timers, since jiffies isn't granular enough
    for some devices these days. From Jan and Jeff

    - CFQ priority boost fix idle classes, from me

    - cleanup series from Ming, improving our bio/bvec iteration

    - a direct issue fix for blk-mq from Omar

    - fix for plug merging not involving the IO scheduler, like we do for
    other types of merges. From Tahsin

    - expose DAX type internally and through sysfs. From Toshi and Yigal

    * 'for-4.8/core' of git://git.kernel.dk/linux-block: (76 commits)
    block: Fix front merge check
    block: do not merge requests without consulting with io scheduler
    block: Fix spelling in a source code comment
    block: expose QUEUE_FLAG_DAX in sysfs
    block: add QUEUE_FLAG_DAX for devices to advertise their DAX support
    Btrfs: fix comparison in __btrfs_map_block()
    block: atari: Return early for unsupported sector size
    Doc: block: Fix a typo in queue-sysfs.txt
    cfq-iosched: Charge at least 1 jiffie instead of 1 ns
    cfq-iosched: Fix regression in bonnie++ rewrite performance
    cfq-iosched: Convert slice_resid from u64 to s64
    block: Convert fifo_time from ulong to u64
    blktrace: avoid using timespec
    block/blk-cgroup.c: Declare local symbols static
    block/bio-integrity.c: Add #include "blk.h"
    block/partition-generic.c: Remove a set-but-not-used variable
    block: bio: kill BIO_MAX_SIZE
    cfq-iosched: temporarily boost queue priority for idle classes
    block: drbd: avoid to use BIO_MAX_SIZE
    block: bio: remove BIO_MAX_SECTORS
    ...

    Linus Torvalds
     

21 Jul, 2016

1 commit

  • These two are confusing leftover of the old world order, combining
    values of the REQ_OP_ and REQ_ namespaces. For callers that don't
    special case we mostly just replace bi_rw with bio_data_dir or
    op_is_write, except for the few cases where a switch over the REQ_OP_
    values makes more sense. Any check for READA is replaced with an
    explicit check for REQ_RAHEAD. Also remove the READA alias for
    REQ_RAHEAD.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Mike Christie
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

27 Jun, 2016

1 commit

  • gfs2 needs to be able to skip the check to see if a page is outside of
    the file size when writing it out. gfs2 can get into a situation where
    it needs to flush its in-memory log to disk while a truncate is in
    progress. If the file being trucated has data journaling enabled, it is
    possible that there are data blocks in the log that are past the end of
    the file. gfs can't finish the log flush without either writing these
    blocks out or revoking them. Otherwise, if the node crashed, it could
    overwrite subsequent changes made by other nodes in the cluster when
    it's journal was replayed.

    Unfortunately, there is no way to add log entries to the log during a
    flush. So gfs2 simply writes out the page instead. This situation can
    only occur when the truncate code still has the file locked exclusively,
    and hasn't marked this block as free in the metadata (which happens
    later in truc_dealloc). After gfs2 writes this page out, the truncation
    code will shortly invalidate it and write out any revokes if necessary.

    In order to make this work, gfs2 needs to be able to skip the check for
    writes outside the file size. Since the check exists in
    block_write_full_page, this patch exports __block_write_full_page, which
    doesn't have the check.

    Signed-off-by: Benjamin Marzinski
    Signed-off-by: Bob Peterson

    Benjamin Marzinski
     

21 Jun, 2016

1 commit

  • Add infrastructure for multipage buffered writes. This is implemented
    using an main iterator that applies an actor function to a range that
    can be written.

    This infrastucture is used to implement a buffered write helper, one
    to zero file ranges and one to implement the ->page_mkwrite VM
    operations. All of them borrow a fair amount of code from fs/buffers.
    for now by using an internal version of __block_write_begin that
    gets passed an iomap and builds the corresponding buffer head.

    The file system is gets a set of paired ->iomap_begin and ->iomap_end
    calls which allow it to map/reserve a range and get a notification
    once the write code is finished with it.

    Based on earlier code from Dave Chinner.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Bob Peterson
    Signed-off-by: Dave Chinner

    Christoph Hellwig
     

08 Jun, 2016

3 commits


20 May, 2016

1 commit

  • The allocator fast path looks up the first usable zone in a zonelist and
    then get_page_from_freelist does the same job in the zonelist iterator.
    This patch preserves the necessary information.

    4.6.0-rc2 4.6.0-rc2
    fastmark-v1r20 initonce-v1r20
    Min alloc-odr0-1 364.00 ( 0.00%) 359.00 ( 1.37%)
    Min alloc-odr0-2 262.00 ( 0.00%) 260.00 ( 0.76%)
    Min alloc-odr0-4 214.00 ( 0.00%) 214.00 ( 0.00%)
    Min alloc-odr0-8 186.00 ( 0.00%) 186.00 ( 0.00%)
    Min alloc-odr0-16 173.00 ( 0.00%) 173.00 ( 0.00%)
    Min alloc-odr0-32 165.00 ( 0.00%) 165.00 ( 0.00%)
    Min alloc-odr0-64 161.00 ( 0.00%) 162.00 ( -0.62%)
    Min alloc-odr0-128 159.00 ( 0.00%) 161.00 ( -1.26%)
    Min alloc-odr0-256 168.00 ( 0.00%) 170.00 ( -1.19%)
    Min alloc-odr0-512 180.00 ( 0.00%) 181.00 ( -0.56%)
    Min alloc-odr0-1024 190.00 ( 0.00%) 190.00 ( 0.00%)
    Min alloc-odr0-2048 196.00 ( 0.00%) 196.00 ( 0.00%)
    Min alloc-odr0-4096 202.00 ( 0.00%) 202.00 ( 0.00%)
    Min alloc-odr0-8192 206.00 ( 0.00%) 205.00 ( 0.49%)
    Min alloc-odr0-16384 206.00 ( 0.00%) 205.00 ( 0.49%)

    The benefit is negligible and the results are within the noise but each
    cycle counts.

    Signed-off-by: Mel Gorman
    Cc: Vlastimil Babka
    Cc: Jesper Dangaard Brouer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

05 Apr, 2016

1 commit

  • PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
    ago with promise that one day it will be possible to implement page
    cache with bigger chunks than PAGE_SIZE.

    This promise never materialized. And unlikely will.

    We have many places where PAGE_CACHE_SIZE assumed to be equal to
    PAGE_SIZE. And it's constant source of confusion on whether
    PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
    especially on the border between fs and mm.

    Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
    breakage to be doable.

    Let's stop pretending that pages in page cache are special. They are
    not.

    The changes are pretty straight-forward:

    - << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

    - page_cache_get() -> get_page();

    - page_cache_release() -> put_page();

    This patch contains automated changes generated with coccinelle using
    script below. For some reason, coccinelle doesn't patch header files.
    I've called spatch for them manually.

    The only adjustment after coccinelle is revert of changes to
    PAGE_CAHCE_ALIGN definition: we are going to drop it later.

    There are few places in the code where coccinelle didn't reach. I'll
    fix them manually in a separate patch. Comments and documentation also
    will be addressed with the separate patch.

    virtual patch

    @@
    expression E;
    @@
    - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    expression E;
    @@
    - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    @@
    - PAGE_CACHE_SHIFT
    + PAGE_SHIFT

    @@
    @@
    - PAGE_CACHE_SIZE
    + PAGE_SIZE

    @@
    @@
    - PAGE_CACHE_MASK
    + PAGE_MASK

    @@
    expression E;
    @@
    - PAGE_CACHE_ALIGN(E)
    + PAGE_ALIGN(E)

    @@
    expression E;
    @@
    - page_cache_get(E)
    + get_page(E)

    @@
    expression E;
    @@
    - page_cache_release(E)
    + put_page(E)

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

16 Mar, 2016

2 commits

  • Now that migration doesn't clear page->mem_cgroup of live pages anymore,
    it's safe to make lock_page_memcg() and the memcg stat functions take
    pages, and spare the callers from memcg objects.

    [akpm@linux-foundation.org: fix warnings]
    Signed-off-by: Johannes Weiner
    Suggested-by: Vladimir Davydov
    Acked-by: Vladimir Davydov
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • These patches tag the page cache radix tree eviction entries with the
    memcg an evicted page belonged to, thus making per-cgroup LRU reclaim
    work properly and be as adaptive to new cache workingsets as global
    reclaim already is.

    This should have been part of the original thrash detection patch
    series, but was deferred due to the complexity of those patches.

    This patch (of 5):

    So far the only sites that needed to exclude charge migration to
    stabilize page->mem_cgroup have been per-cgroup page statistics, hence
    the name mem_cgroup_begin_page_stat(). But per-cgroup thrash detection
    will add another site that needs to ensure page->mem_cgroup lifetime.

    Rename these locking functions to the more generic lock_page_memcg() and
    unlock_page_memcg(). Since charge migration is a cgroup1 feature only,
    we might be able to delete it at some point, and these now easy to
    identify locking sites along with it.

    Signed-off-by: Johannes Weiner
    Suggested-by: Vladimir Davydov
    Acked-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

07 Jan, 2016

1 commit


11 Nov, 2015

1 commit

  • The function currently called "__block_page_mkwrite()" used to be called
    "block_page_mkwrite()" until a wrapper for this function was added by:

    commit 24da4fab5a61 ("vfs: Create __block_page_mkwrite() helper passing
    error values back")

    This wrapper, the current "block_page_mkwrite()", is currently unused.
    __block_page_mkwrite() is used directly by ext4, nilfs2 and xfs.

    Remove the unused wrapper, rename __block_page_mkwrite() back to
    block_page_mkwrite() and update the comment above block_page_mkwrite().

    Signed-off-by: Ross Zwisler
    Reviewed-by: Jan Kara
    Cc: Jan Kara
    Cc: Christoph Hellwig
    Cc: Al Viro
    Signed-off-by: Al Viro

    Ross Zwisler
     

07 Nov, 2015

1 commit

  • There are many places which use mapping_gfp_mask to restrict a more
    generic gfp mask which would be used for allocations which are not
    directly related to the page cache but they are performed in the same
    context.

    Let's introduce a helper function which makes the restriction explicit and
    easier to track. This patch doesn't introduce any functional changes.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Michal Hocko
    Suggested-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

14 Aug, 2015

1 commit

  • Call pre-defined helper bio_add_page() instead of open coding for
    iterating through bi_io_vec[]. Doing that, it's possible to make some
    parts in filesystems and mm/page_io.c simpler than before.

    Acked-by: Dave Kleikamp
    Cc: Christoph Hellwig
    Cc: Al Viro
    Cc: linux-fsdevel@vger.kernel.org
    Signed-off-by: Kent Overstreet
    [dpark: add more description in commit message]
    Signed-off-by: Dongsu Park
    Signed-off-by: Ming Lin
    Signed-off-by: Jens Axboe

    Kent Overstreet
     

29 Jul, 2015

2 commits

  • Some places use helpers now, others don't. We only have the 'is set'
    helper, add helpers for setting and clearing flags too.

    It was a bit of a mess of atomic vs non-atomic access. With
    BIO_UPTODATE gone, we don't have any risk of concurrent access to the
    flags. So relax the restriction and don't make any of them atomic. The
    flags that do have serialization issues (reffed and chained), we
    already handle those separately.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Currently we have two different ways to signal an I/O error on a BIO:

    (1) by clearing the BIO_UPTODATE flag
    (2) by returning a Linux errno value to the bi_end_io callback

    The first one has the drawback of only communicating a single possible
    error (-EIO), and the second one has the drawback of not beeing persistent
    when bios are queued up, and are not passed along from child to parent
    bio in the ever more popular chaining scenario. Having both mechanisms
    available has the additional drawback of utterly confusing driver authors
    and introducing bugs where various I/O submitters only deal with one of
    them, and the others have to add boilerplate code to deal with both kinds
    of error returns.

    So add a new bi_error field to store an errno value directly in struct
    bio and remove the existing mechanisms to clean all this up.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Reviewed-by: NeilBrown
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

02 Jun, 2015

6 commits

  • Merge hickup on my part, due to a clash between the writeback
    changes and the EOPNOTSUPP removal in _submit_bh().

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • As concurrent write sharing of an inode is expected to be very rare
    and memcg only tracks page ownership on first-use basis severely
    confining the usefulness of such sharing, cgroup writeback tracks
    ownership per-inode. While the support for concurrent write sharing
    of an inode is deemed unnecessary, an inode being written to by
    different cgroups at different points in time is a lot more common,
    and, more importantly, charging only by first-use can too readily lead
    to grossly incorrect behaviors (single foreign page can lead to
    gigabytes of writeback to be incorrectly attributed).

    To resolve this issue, cgroup writeback detects the majority dirtier
    of an inode and will transfer the ownership to it. To avoid
    unnnecessary oscillation, the detection mechanism keeps track of
    history and gives out the switch verdict only if the foreign usage
    pattern is stable over a certain amount of time and/or writeback
    attempts.

    The detection mechanism has fairly low space and computation overhead.
    It adds 8 bytes to struct inode (one int and two u16's) and minimal
    amount of calculation per IO. The detection mechanism converges to
    the correct answer usually in several seconds of IO time when there's
    a clear majority dirtier. Even when there isn't, it can reach an
    acceptable answer fairly quickly under most circumstances.

    Please see wb_detach_inode() for more details.

    This patch only implements detection. Following patches will
    implement actual switching.

    v2: wbc_account_io() now checks whether the wbc is associated with a
    wb before dereferencing it. This can happen when pageout() is
    writing pages directly without going through the usual writeback
    path. As pageout() path is single-threaded, we don't want it to
    be blocked behind a slow cgroup and ultimately want it to delegate
    actual writing to the usual writeback path.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Jan Kara
    Cc: Wu Fengguang
    Cc: Greg Thelen
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Currently, for cgroup writeback, the IO submission paths directly
    associate the bio's with the blkcg from inode_to_wb_blkcg_css();
    however, it'd be necessary to keep more writeback context to implement
    foreign inode writeback detection. wbc (writeback_control) is the
    natural fit for the extra context - it persists throughout the
    writeback of each inode and is passed all the way down to IO
    submission paths.

    This patch adds wbc_attach_and_unlock_inode(), wbc_detach_inode(), and
    wbc_attach_fdatawrite_inode() which are used to associate wbc with the
    inode being written back. IO submission paths now use wbc_init_bio()
    instead of directly associating bio's with blkcg themselves. This
    leaves inode_to_wb_blkcg_css() w/o any user. The function is removed.

    wbc currently only tracks the associated wb (bdi_writeback). Future
    patches will add more for foreign inode detection. The association is
    established under i_lock which will be depended upon when migrating
    foreign inodes to other wb's.

    As currently, once established, inode to wb association never changes,
    going through wbc when initializing bio's doesn't cause any behavior
    changes.

    v2: submit_blk_blkcg() now checks whether the wbc is associated with a
    wb before dereferencing it. This can happen when pageout() is
    writing pages directly without going through the usual writeback
    path. As pageout() path is single-threaded, we don't want it to
    be blocked behind a slow cgroup and ultimately want it to delegate
    actual writing to the usual writeback path.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Jan Kara
    Cc: Wu Fengguang
    Cc: Greg Thelen
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • [__]block_write_full_page() is used to implement ->writepage in
    various filesystems. All writeback logic is now updated to handle
    cgroup writeback and the block cgroup to issue IOs for is encoded in
    writeback_control and can be retrieved from the inode; however,
    [__]block_write_full_page() currently ignores the blkcg indicated by
    inode and issues all bio's without explicit blkcg association.

    This patch adds submit_bh_blkcg() which associates the bio with the
    specified blkio cgroup before issuing and uses it in
    __block_write_full_page() so that the issued bio's are associated with
    inode_to_wb_blkcg_css(inode).

    v2: Updated for per-inode wb association.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Jan Kara
    Cc: Andrew Morton
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • When modifying PG_Dirty on cached file pages, update the new
    MEM_CGROUP_STAT_DIRTY counter. This is done in the same places where
    global NR_FILE_DIRTY is managed. The new memcg stat is visible in the
    per memcg memory.stat cgroupfs file. The most recent past attempt at
    this was http://thread.gmane.org/gmane.linux.kernel.cgroups/8632

    The new accounting supports future efforts to add per cgroup dirty
    page throttling and writeback. It also helps an administrator break
    down a container's memory usage and provides evidence to understand
    memcg oom kills (the new dirty count is included in memcg oom kill
    messages).

    The ability to move page accounting between memcg
    (memory.move_charge_at_immigrate) makes this accounting more
    complicated than the global counter. The existing
    mem_cgroup_{begin,end}_page_stat() lock is used to serialize move
    accounting with stat updates.
    Typical update operation:
    memcg = mem_cgroup_begin_page_stat(page)
    if (TestSetPageDirty()) {
    [...]
    mem_cgroup_update_page_stat(memcg)
    }
    mem_cgroup_end_page_stat(memcg)

    Summary of mem_cgroup_end_page_stat() overhead:
    - Without CONFIG_MEMCG it's a no-op
    - With CONFIG_MEMCG and no inter memcg task movement, it's just
    rcu_read_lock()
    - With CONFIG_MEMCG and inter memcg task movement, it's
    rcu_read_lock() + spin_lock_irqsave()

    A memcg parameter is added to several routines because their callers
    now grab mem_cgroup_begin_page_stat() which returns the memcg later
    needed by for mem_cgroup_update_page_stat().

    Because mem_cgroup_begin_page_stat() may disable interrupts, some
    adjustments are needed:
    - move __mark_inode_dirty() from __set_page_dirty() to its caller.
    __mark_inode_dirty() locking does not want interrupts disabled.
    - use spin_lock_irqsave(tree_lock) rather than spin_lock_irq() in
    __delete_from_page_cache(), replace_page_cache_page(),
    invalidate_complete_page2(), and __remove_mapping().

    text data bss dec hex filename
    8925147 1774832 1785856 12485835 be84cb vmlinux-!CONFIG_MEMCG-before
    8925339 1774832 1785856 12486027 be858b vmlinux-!CONFIG_MEMCG-after
    +192 text bytes
    8965977 1784992 1785856 12536825 bf4bf9 vmlinux-CONFIG_MEMCG-before
    8966750 1784992 1785856 12537598 bf4efe vmlinux-CONFIG_MEMCG-after
    +773 text bytes

    Performance tests run on v4.0-rc1-36-g4f671fe2f952. Lower is better for
    all metrics, they're all wall clock or cycle counts. The read and write
    fault benchmarks just measure fault time, they do not include I/O time.

    * CONFIG_MEMCG not set:
    baseline patched
    kbuild 1m25.030000(+-0.088% 3 samples) 1m25.426667(+-0.120% 3 samples)
    dd write 100 MiB 0.859211561 +-15.10% 0.874162885 +-15.03%
    dd write 200 MiB 1.670653105 +-17.87% 1.669384764 +-11.99%
    dd write 1000 MiB 8.434691190 +-14.15% 8.474733215 +-14.77%
    read fault cycles 254.0(+-0.000% 10 samples) 253.0(+-0.000% 10 samples)
    write fault cycles 2021.2(+-3.070% 10 samples) 1984.5(+-1.036% 10 samples)

    * CONFIG_MEMCG=y root_memcg:
    baseline patched
    kbuild 1m25.716667(+-0.105% 3 samples) 1m25.686667(+-0.153% 3 samples)
    dd write 100 MiB 0.855650830 +-14.90% 0.887557919 +-14.90%
    dd write 200 MiB 1.688322953 +-12.72% 1.667682724 +-13.33%
    dd write 1000 MiB 8.418601605 +-14.30% 8.673532299 +-15.00%
    read fault cycles 266.0(+-0.000% 10 samples) 266.0(+-0.000% 10 samples)
    write fault cycles 2051.7(+-1.349% 10 samples) 2049.6(+-1.686% 10 samples)

    * CONFIG_MEMCG=y non-root_memcg:
    baseline patched
    kbuild 1m26.120000(+-0.273% 3 samples) 1m25.763333(+-0.127% 3 samples)
    dd write 100 MiB 0.861723964 +-15.25% 0.818129350 +-14.82%
    dd write 200 MiB 1.669887569 +-13.30% 1.698645885 +-13.27%
    dd write 1000 MiB 8.383191730 +-14.65% 8.351742280 +-14.52%
    read fault cycles 265.7(+-0.172% 10 samples) 267.0(+-0.000% 10 samples)
    write fault cycles 2070.6(+-1.512% 10 samples) 2084.4(+-2.148% 10 samples)

    As expected anon page faults are not affected by this patch.

    tj: Updated to apply on top of the recent cancel_dirty_page() changes.

    Signed-off-by: Sha Zhengju
    Signed-off-by: Greg Thelen
    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Greg Thelen
     
  • cancel_dirty_page() had some issues and b9ea25152e56 ("page_writeback:
    clean up mess around cancel_dirty_page()") replaced it with
    account_page_cleaned() which makes the caller responsible for clearing
    the dirty bit; unfortunately, the planned changes for cgroup writeback
    support requires synchronization between dirty bit manipulation and
    stat updates. While we can open-code such synchronization in each
    account_page_cleaned() callsite, that's gonna be unnecessarily awkward
    and verbose.

    This patch revives cancel_dirty_page() but in a more restricted form.
    All it does is TestClearPageDirty() followed by account_page_cleaned()
    invocation if the page was dirty. This helper covers all
    account_page_cleaned() usages except for __delete_from_page_cache()
    which is a special case anyway and left alone. As this leaves no
    module user for account_page_cleaned(), EXPORT_SYMBOL() is dropped
    from it.

    This patch just revives cancel_dirty_page() as a trivial wrapper to
    replace equivalent usages and doesn't introduce any functional
    changes.

    Signed-off-by: Tejun Heo
    Cc: Konstantin Khlebnikov
    Signed-off-by: Jens Axboe

    Tejun Heo
     

27 May, 2015

1 commit


19 May, 2015

1 commit

  • Since the big barrier rewrite/removal in 2007 we never fail FLUSH or
    FUA requests, which means we can remove the magic BIO_EOPNOTSUPP flag
    to help propagating those to the buffer_head layer.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jeff Moyer
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

15 Apr, 2015

1 commit

  • This patch replaces cancel_dirty_page() with a helper function
    account_page_cleaned() which only updates counters. It's called from
    truncate_complete_page() and from try_to_free_buffers() (hack for ext3).
    Page is locked in both cases, page-lock protects against concurrent
    dirtiers: see commit 2d6d7f982846 ("mm: protect set_page_dirty() from
    ongoing truncation").

    Delete_from_page_cache() shouldn't be called for dirty pages, they must
    be handled by caller (either written or truncated). This patch treats
    final dirty accounting fixup at the end of __delete_from_page_cache() as
    a debug check and adds WARN_ON_ONCE() around it. If something removes
    dirty pages without proper handling that might be a bug and unwritten
    data might be lost.

    Hugetlbfs has no dirty pages accounting, ClearPageDirty() is enough
    here.

    cancel_dirty_page() in nfs_wb_page_cancel() is redundant. This is
    helper for nfs_invalidate_page() and it's called only in case complete
    invalidation.

    The mess was started in v2.6.20 after commits 46d2277c796f ("Clean up
    and make try_to_free_buffers() not race with dirty pages") and
    3e67c0987d75 ("truncate: clear page dirtiness before running
    try_to_free_buffers()") first was reverted right in v2.6.20 in commit
    ecdfc9787fe5 ("Resurrect 'try_to_free_buffers()' VM hackery"), second in
    v2.6.25 commit a2b345642f53 ("Fix dirty page accounting leak with ext3
    data=journal").

    Custom fixes were introduced between these points. NFS in v2.6.23, commit
    1b3b4a1a2deb ("NFS: Fix a write request leak in nfs_invalidate_page()").
    Kludge in __delete_from_page_cache() in v2.6.24, commit 3a6927906f1b ("Do
    dirty page accounting when removing a page from the page cache"). Since
    v2.6.25 all of them are redundant.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Konstantin Khlebnikov
    Cc: Tejun Heo
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

22 Oct, 2014

2 commits

  • When quiet_error applies rate limiting to buffer_io_error calls, what the
    they apply to is unclear because the name is so generic, particularly
    if the messages are interleaved with others:

    [ 1936.063572] quiet_error: 664293 callbacks suppressed
    [ 1936.065297] Buffer I/O error on dev sdr, logical block 257429952, lost async page write
    [ 1936.067814] Buffer I/O error on dev sdr, logical block 257429953, lost async page write

    Also, the function uses printk_ratelimit(), although printk.h includes a
    comment advising "Please don't use... Instead use printk_ratelimited()."

    Change buffer_io_error to check the BH_Quiet bit itself, drop the
    printk_ratelimit call, and print using printk_ratelimited.

    This makes the messages look like:

    [ 387.208839] buffer_io_error: 676394 callbacks suppressed
    [ 387.210693] Buffer I/O error on dev sdr, logical block 211291776, lost async page write
    [ 387.213432] Buffer I/O error on dev sdr, logical block 211291777, lost async page write

    Signed-off-by: Robert Elliott
    Reviewed-by: Webb Scales
    Signed-off-by: Jens Axboe

    Robert Elliott
     
  • buffer.c uses two printk calls to print these messages:
    [67353.422338] Buffer I/O error on device sdr, logical block 212868488
    [67353.422338] lost page write due to I/O error on sdr

    In a busy system, they may be interleaved with other prints,
    losing the context for the second message. Merge them into
    one line with one printk call so the prints are atomic.

    Also, differentiate between async page writes, sync page writes, and
    async page reads.

    Also, shorten "device" to "dev" to match the block layer prints:
    [67353.467906] blk_update_request: critical target error, dev sdr, sector
    1707107328

    Also, use %llu rather than %Lu.

    Resulting prints look like:
    [ 1356.437006] blk_update_request: critical target error, dev sdr, sector 1719693992
    [ 1361.383522] quiet_error: 659876 callbacks suppressed
    [ 1361.385816] Buffer I/O error on dev sdr, logical block 256902912, lost async page write
    [ 1361.385819] Buffer I/O error on dev sdr, logical block 256903644, lost async page write

    Signed-off-by: Robert Elliott
    Reviewed-by: Webb Scales
    Signed-off-by: Jens Axboe

    Robert Elliott
     

21 Oct, 2014

1 commit

  • Pull ext4 updates from Ted Ts'o:
    "A large number of cleanups and bug fixes, with some (minor) journal
    optimizations"

    [ This got sent to me before -rc1, but was stuck in my spam folder. - Linus ]

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (67 commits)
    ext4: check s_chksum_driver when looking for bg csum presence
    ext4: move error report out of atomic context in ext4_init_block_bitmap()
    ext4: Replace open coded mdata csum feature to helper function
    ext4: delete useless comments about ext4_move_extents
    ext4: fix reservation overflow in ext4_da_write_begin
    ext4: add ext4_iget_normal() which is to be used for dir tree lookups
    ext4: don't orphan or truncate the boot loader inode
    ext4: grab missed write_count for EXT4_IOC_SWAP_BOOT
    ext4: optimize block allocation on grow indepth
    ext4: get rid of code duplication
    ext4: fix over-defensive complaint after journal abort
    ext4: fix return value of ext4_do_update_inode
    ext4: fix mmap data corruption when blocksize < pagesize
    vfs: fix data corruption when blocksize < pagesize for mmaped data
    ext4: fold ext4_nojournal_sops into ext4_sops
    ext4: support freezing ext2 (nojournal) file systems
    ext4: fold ext4_sync_fs_nojournal() into ext4_sync_fs()
    ext4: don't check quota format when there are no quota files
    jbd2: simplify calling convention around __jbd2_journal_clean_checkpoint_list
    jbd2: avoid pointless scanning of checkpoint lists
    ...

    Linus Torvalds
     

14 Oct, 2014

1 commit

  • It's very common for the buffer heads in the lru to have different block
    numbers. By comparing the blocknr before the bdev and size we can
    reduce the cost of searching in the very common case where all the
    entries have the same bdev and size.

    In quick hot cache cycle counting tests on a single fs workstation this
    cut the cost of a miss by about 20%.

    A diff of the disassembly shows the reordering of the bdev and blocknr
    comparisons. This is in such a tiny loop that skipping one comparison
    is a meaningful portion of the total work being done:

    1628: 83 c1 01 add $0x1,%ecx
    162b: 83 f9 08 cmp $0x8,%ecx
    162e: 74 60 je 1690
    1630: 89 c8 mov %ecx,%eax
    1632: 65 4c 8b 04 c5 00 00 mov %gs:0x0(,%rax,8),%r8
    1639: 00 00
    163b: 4d 85 c0 test %r8,%r8
    163e: 4c 89 c3 mov %r8,%rbx
    1641: 74 e5 je 1628
    - 1643: 4d 3b 68 30 cmp 0x30(%r8),%r13
    + 1643: 4d 3b 68 18 cmp 0x18(%r8),%r13
    1647: 75 df jne 1628
    - 1649: 4d 3b 60 18 cmp 0x18(%r8),%r12
    + 1649: 4d 3b 60 30 cmp 0x30(%r8),%r12
    164d: 75 d9 jne 1628
    164f: 49 39 50 20 cmp %rdx,0x20(%r8)
    1653: 75 d3 jne 1628

    Signed-off-by: Zach Brown
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zach Brown
     

13 Oct, 2014

1 commit

  • Pull vfs updates from Al Viro:
    "The big thing in this pile is Eric's unmount-on-rmdir series; we
    finally have everything we need for that. The final piece of prereqs
    is delayed mntput() - now filesystem shutdown always happens on
    shallow stack.

    Other than that, we have several new primitives for iov_iter (Matt
    Wilcox, culled from his XIP-related series) pushing the conversion to
    ->read_iter()/ ->write_iter() a bit more, a bunch of fs/dcache.c
    cleanups and fixes (including the external name refcounting, which
    gives consistent behaviour of d_move() wrt procfs symlinks for long
    and short names alike) and assorted cleanups and fixes all over the
    place.

    This is just the first pile; there's a lot of stuff from various
    people that ought to go in this window. Starting with
    unionmount/overlayfs mess... ;-/"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (60 commits)
    fs/file_table.c: Update alloc_file() comment
    vfs: Deduplicate code shared by xattr system calls operating on paths
    reiserfs: remove pointless forward declaration of struct nameidata
    don't need that forward declaration of struct nameidata in dcache.h anymore
    take dname_external() into fs/dcache.c
    let path_init() failures treated the same way as subsequent link_path_walk()
    fix misuses of f_count() in ppp and netlink
    ncpfs: use list_for_each_entry() for d_subdirs walk
    vfs: move getname() from callers to do_mount()
    gfs2_atomic_open(): skip lookups on hashed dentry
    [infiniband] remove pointless assignments
    gadgetfs: saner API for gadgetfs_create_file()
    f_fs: saner API for ffs_sb_create_file()
    jfs: don't hash direct inode
    [s390] remove pointless assignment of ->f_op in vmlogrdr ->open()
    ecryptfs: ->f_op is never NULL
    android: ->f_op is never NULL
    nouveau: __iomem misannotations
    missing annotation in fs/file.c
    fs: namespace: suppress 'may be used uninitialized' warnings
    ...

    Linus Torvalds
     

10 Oct, 2014

3 commits

  • Increase the buffer-head per-CPU LRU size to allow efficient filesystem
    operations that access many blocks for each transaction. For example,
    creating a file in a large ext4 directory with quota enabled will access
    multiple buffer heads and will overflow the LRU at the default 8-block LRU
    size:

    * parent directory inode table block (ctime, nlinks for subdirs)
    * new inode bitmap
    * inode table block
    * 2 quota blocks
    * directory leaf block (not reused, but pollutes one cache entry)
    * 2 levels htree blocks (only one is reused, other pollutes cache)
    * 2 levels indirect/index blocks (only one is reused)

    The buffer-head per-CPU LRU size is raised to 16, as it shows in metadata
    performance benchmarks up to 10% gain for create, 4% for lookup and 7% for
    destroy.

    Signed-off-by: Liang Zhen
    Signed-off-by: Andreas Dilger
    Signed-off-by: Sebastien Buisson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sebastien Buisson
     
  • Add guard_bio_eod() check for mpage code in order to allow us to do IO
    even on the odd last sectors of a device, even if the block size is some
    multiple of the physical sector size.

    Using mpage_readpages() for block device requires this guard check.

    Signed-off-by: Akinobu Mita
    Cc: Jens Axboe
    Cc: Alexander Viro
    Cc: Jeff Moyer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • This patchset implements readpages() operation for block device by using
    mpage_readpages() which can create multipage BIOs instead of BIOs for each
    page and reduce system CPU time consumption.

    This patch (of 3):

    guard_bh_eod() is used in submit_bh() to allow us to do IO even on the odd
    last sectors of a device, even if the block size is some multiple of the
    physical sector size. This makes guard_bh_eod() more generic and renames
    it guard_bio_eod() so that we can use it without struct buffer_head
    argument.

    The reason for this change is that using mpage_readpages() for block
    device requires to add this guard check in mpage code.

    Signed-off-by: Akinobu Mita
    Cc: Jens Axboe
    Cc: Alexander Viro
    Cc: Jeff Moyer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     

09 Oct, 2014

1 commit

  • This patch makes it possible to kill a process looping in
    cont_expand_zero. A process may spend a lot of time in this function, so
    it is desirable to be able to kill it.

    It happened to me that I wanted to copy a piece data from the disk to a
    file. By mistake, I used the "seek" parameter to dd instead of "skip". Due
    to the "seek" parameter, dd attempted to extend the file and became stuck
    doing so - the only possibility was to reset the machine or wait many
    hours until the filesystem runs out of space and cont_expand_zero fails.
    We need this patch to be able to terminate the process.

    Signed-off-by: Mikulas Patocka
    Cc: stable@vger.kernel.org
    Signed-off-by: Al Viro

    Mikulas Patocka
     

02 Oct, 2014

1 commit

  • ->page_mkwrite() is used by filesystems to allocate blocks under a page
    which is becoming writeably mmapped in some process' address space. This
    allows a filesystem to return a page fault if there is not enough space
    available, user exceeds quota or similar problem happens, rather than
    silently discarding data later when writepage is called.

    However VFS fails to call ->page_mkwrite() in all the cases where
    filesystems need it when blocksize < pagesize. For example when
    blocksize = 1024, pagesize = 4096 the following is problematic:
    ftruncate(fd, 0);
    pwrite(fd, buf, 1024, 0);
    map = mmap(NULL, 1024, PROT_WRITE, MAP_SHARED, fd, 0);
    map[0] = 'a'; ----> page_mkwrite() for index 0 is called
    ftruncate(fd, 10000); /* or even pwrite(fd, buf, 1, 10000) */
    mremap(map, 1024, 10000, 0);
    map[4095] = 'a'; ----> no page_mkwrite() called

    At the moment ->page_mkwrite() is called, filesystem can allocate only
    one block for the page because i_size == 1024. Otherwise it would create
    blocks beyond i_size which is generally undesirable. But later at
    ->writepage() time, we also need to store data at offset 4095 but we
    don't have block allocated for it.

    This patch introduces a helper function filesystems can use to have
    ->page_mkwrite() called at all the necessary moments.

    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o
    Cc: stable@vger.kernel.org

    Jan Kara