12 Nov, 2015

2 commits


11 Nov, 2015

1 commit

  • Pull block IO poll support from Jens Axboe:
    "Various groups have been doing experimentation around IO polling for
    (really) fast devices. The code has been reviewed and has been
    sitting on the side for a few releases, but this is now good enough
    for coordinated benchmarking and further experimentation.

    Currently O_DIRECT sync read/write are supported. A framework is in
    the works that allows scalable stats tracking so we can auto-tune
    this. And we'll add libaio support as well soon. Fow now, it's an
    opt-in feature for test purposes"

    * 'for-4.4/io-poll' of git://git.kernel.dk/linux-block:
    direct-io: be sure to assign dio->bio_bdev for both paths
    directio: add block polling support
    NVMe: add blk polling support
    block: add block polling support
    blk-mq: return tag/queue combo in the make_request_fn handlers
    block: change ->make_request_fn() and users to return a queue cookie

    Linus Torvalds
     

08 Nov, 2015

3 commits

  • Add basic support for polling for specific IO to complete. This uses
    the cookie that blk-mq passes back, which enables the block layer
    to pass this cookie to the driver to spin for a specific request.

    This will be combined with request latency tracking, so we can make
    qualified decisions about when to poll and when not to. For now, for
    benchmark purposes, we add a sysfs file that controls whether polling
    is enabled or not.

    Signed-off-by: Jens Axboe
    Acked-by: Christoph Hellwig
    Acked-by: Keith Busch

    Jens Axboe
     
  • Return a cookie, blk_qc_t, from the blk-mq make request functions, that
    allows a later caller to uniquely identify a specific IO. The cookie
    doesn't mean anything to the caller, but the caller can use it to later
    pass back to the block layer. The block layer can then identify the
    hardware queue and request from that cookie.

    Signed-off-by: Jens Axboe
    Acked-by: Christoph Hellwig
    Acked-by: Keith Busch

    Jens Axboe
     
  • No functional changes in this patch, but it prepares us for returning
    a more useful cookie related to the IO that was queued up.

    Signed-off-by: Jens Axboe
    Acked-by: Christoph Hellwig
    Acked-by: Keith Busch

    Jens Axboe
     

07 Nov, 2015

3 commits

  • setpriority(PRIO_USER, 0, x) will change the priority of tasks outside of
    the current pid namespace. This is in contrast to both the other modes of
    setpriority and the example of kill(-1). Fix this. getpriority and
    ioprio have the same failure mode, fix them too.

    Eric said:

    : After some more thinking about it this patch sounds justifiable.
    :
    : My goal with namespaces is not to build perfect isolation mechanisms
    : as that can get into ill defined territory, but to build well defined
    : mechanisms. And to handle the corner cases so you can use only
    : a single namespace with well defined results.
    :
    : In this case you have found the two interfaces I am aware of that
    : identify processes by uid instead of by pid. Which quite frankly is
    : weird. Unfortunately the weird unexpected cases are hard to handle
    : in the usual way.
    :
    : I was hoping for a little more information. Changes like this one we
    : have to be careful of because someone might be depending on the current
    : behavior. I don't think they are and I do think this make sense as part
    : of the pid namespace.

    Signed-off-by: Ben Segall
    Cc: Oleg Nesterov
    Cc: Al Viro
    Cc: Ambrose Feinstein
    Acked-by: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ben Segall
     
  • __GFP_WAIT was used to signal that the caller was in atomic context and
    could not sleep. Now it is possible to distinguish between true atomic
    context and callers that are not willing to sleep. The latter should
    clear __GFP_DIRECT_RECLAIM so kswapd will still wake. As clearing
    __GFP_WAIT behaves differently, there is a risk that people will clear the
    wrong flags. This patch renames __GFP_WAIT to __GFP_RECLAIM to clearly
    indicate what it does -- setting it allows all reclaim activity, clearing
    them prevents it.

    [akpm@linux-foundation.org: fix build]
    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Mel Gorman
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Acked-by: Johannes Weiner
    Cc: Christoph Lameter
    Acked-by: David Rientjes
    Cc: Vitaly Wool
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • …d avoiding waking kswapd

    __GFP_WAIT has been used to identify atomic context in callers that hold
    spinlocks or are in interrupts. They are expected to be high priority and
    have access one of two watermarks lower than "min" which can be referred
    to as the "atomic reserve". __GFP_HIGH users get access to the first
    lower watermark and can be called the "high priority reserve".

    Over time, callers had a requirement to not block when fallback options
    were available. Some have abused __GFP_WAIT leading to a situation where
    an optimisitic allocation with a fallback option can access atomic
    reserves.

    This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
    cannot sleep and have no alternative. High priority users continue to use
    __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and
    are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify
    callers that want to wake kswapd for background reclaim. __GFP_WAIT is
    redefined as a caller that is willing to enter direct reclaim and wake
    kswapd for background reclaim.

    This patch then converts a number of sites

    o __GFP_ATOMIC is used by callers that are high priority and have memory
    pools for those requests. GFP_ATOMIC uses this flag.

    o Callers that have a limited mempool to guarantee forward progress clear
    __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
    into this category where kswapd will still be woken but atomic reserves
    are not used as there is a one-entry mempool to guarantee progress.

    o Callers that are checking if they are non-blocking should use the
    helper gfpflags_allow_blocking() where possible. This is because
    checking for __GFP_WAIT as was done historically now can trigger false
    positives. Some exceptions like dm-crypt.c exist where the code intent
    is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
    flag manipulations.

    o Callers that built their own GFP flags instead of starting with GFP_KERNEL
    and friends now also need to specify __GFP_KSWAPD_RECLAIM.

    The first key hazard to watch out for is callers that removed __GFP_WAIT
    and was depending on access to atomic reserves for inconspicuous reasons.
    In some cases it may be appropriate for them to use __GFP_HIGH.

    The second key hazard is callers that assembled their own combination of
    GFP flags instead of starting with something like GFP_KERNEL. They may
    now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless
    if it's missed in most cases as other activity will wake kswapd.

    Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Vitaly Wool <vitalywool@gmail.com>
    Cc: Rik van Riel <riel@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Mel Gorman
     

06 Nov, 2015

1 commit

  • Pull cgroup updates from Tejun Heo:
    "The cgroup core saw several significant updates this cycle:

    - percpu_rwsem for threadgroup locking is reinstated. This was
    temporarily dropped due to down_write latency issues. Oleg's
    rework of percpu_rwsem which is scheduled to be merged in this
    merge window resolves the issue.

    - On the v2 hierarchy, when controllers are enabled and disabled, all
    operations are atomic and can fail and revert cleanly. This allows
    ->can_attach() failure which is necessary for cpu RT slices.

    - Tasks now stay associated with the original cgroups after exit
    until released. This allows tracking resources held by zombies
    (e.g. pids) and makes it easy to find out where zombies came from
    on the v2 hierarchy. The pids controller was broken before these
    changes as zombies escaped the limits; unfortunately, updating this
    behavior required too many invasive changes and I don't think it's
    a good idea to backport them, so the pids controller on 4.3, the
    first version which included the pids controller, will stay broken
    at least until I'm sure about the cgroup core changes.

    - Optimization of a couple common tests using static_key"

    * 'for-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (38 commits)
    cgroup: fix race condition around termination check in css_task_iter_next()
    blkcg: don't create "io.stat" on the root cgroup
    cgroup: drop cgroup__DEVEL__legacy_files_on_dfl
    cgroup: replace error handling in cgroup_init() with WARN_ON()s
    cgroup: add cgroup_subsys->free() method and use it to fix pids controller
    cgroup: keep zombies associated with their original cgroups
    cgroup: make css_set_rwsem a spinlock and rename it to css_set_lock
    cgroup: don't hold css_set_rwsem across css task iteration
    cgroup: reorganize css_task_iter functions
    cgroup: factor out css_set_move_task()
    cgroup: keep css_set and task lists in chronological order
    cgroup: make cgroup_destroy_locked() test cgroup_is_populated()
    cgroup: make css_sets pin the associated cgroups
    cgroup: relocate cgroup_[try]get/put()
    cgroup: move check_for_release() invocation
    cgroup: replace cgroup_has_tasks() with cgroup_is_populated()
    cgroup: make cgroup->nr_populated count the number of populated css_sets
    cgroup: remove an unused parameter from cgroup_task_migrate()
    cgroup: fix too early usage of static_branch_disable()
    cgroup: make cgroup_update_dfl_csses() migrate all target processes atomically
    ...

    Linus Torvalds
     

05 Nov, 2015

3 commits

  • Pull block reservation support from Jens Axboe:
    "This adds support for persistent reservations, both at the core level,
    as well as for sd and NVMe"

    [ Background from the docs: "Persistent Reservations allow restricting
    access to block devices to specific initiators in a shared storage
    setup. All implementations are expected to ensure the reservations
    survive a power loss and cover all connections in a multi path
    environment" ]

    * 'for-4.4/reservations' of git://git.kernel.dk/linux-block:
    NVMe: Precedence error in nvme_pr_clear()
    nvme: add missing endianess annotations in nvme_pr_command
    NVMe: Add persistent reservation ops
    sd: implement the Persistent Reservation API
    block: add an API for Persistent Reservations
    block: cleanup blkdev_ioctl

    Linus Torvalds
     
  • Pull block integrity updates from Jens Axboe:
    ""This is the joint work of Dan and Martin, cleaning up and improving
    the support for block data integrity"

    * 'for-4.4/integrity' of git://git.kernel.dk/linux-block:
    block, libnvdimm, nvme: provide a built-in blk_integrity nop profile
    block: blk_flush_integrity() for bio-based drivers
    block: move blk_integrity to request_queue
    block: generic request_queue reference counting
    nvme: suspend i/o during runtime blk_integrity_unregister
    md: suspend i/o during runtime blk_integrity_unregister
    md, dm, scsi, nvme, libnvdimm: drop blk_integrity_unregister() at shutdown
    block: Inline blk_integrity in struct gendisk
    block: Export integrity data interval size in sysfs
    block: Reduce the size of struct blk_integrity
    block: Consolidate static integrity profile properties
    block: Move integrity kobject to struct gendisk

    Linus Torvalds
     
  • Pull core block updates from Jens Axboe:
    "This is the core block pull request for 4.4. I've got a few more
    topic branches this time around, some of them will layer on top of the
    core+drivers changes and will come in a separate round. So not a huge
    chunk of changes in this round.

    This pull request contains:

    - Enable blk-mq page allocation tracking with kmemleak, from Catalin.

    - Unused prototype removal in blk-mq from Christoph.

    - Cleanup of the q->blk_trace exchange, using cmpxchg instead of two
    xchg()'s, from Davidlohr.

    - A plug flush fix from Jeff.

    - Also from Jeff, a fix that means we don't have to update shared tag
    sets at init time unless we do a state change. This cuts down boot
    times on thousands of devices a lot with scsi/blk-mq.

    - blk-mq waitqueue barrier fix from Kosuke.

    - Various fixes from Ming:

    - Fixes for segment merging and splitting, and checks, for
    the old core and blk-mq.

    - Potential blk-mq speedup by marking ctx pending at the end
    of a plug insertion batch in blk-mq.

    - direct-io no page dirty on kernel direct reads.

    - A WRITE_SYNC fix for mpage from Roman"

    * 'for-4.4/core' of git://git.kernel.dk/linux-block:
    blk-mq: avoid excessive boot delays with large lun counts
    blktrace: re-write setting q->blk_trace
    blk-mq: mark ctx as pending at batch in flush plug path
    blk-mq: fix for trace_block_plug()
    block: check bio_mergeable() early before merging
    blk-mq: check bio_mergeable() early before merging
    block: avoid to merge splitted bio
    block: setup bi_phys_segments after splitting
    block: fix plug list flushing for nomerge queues
    blk-mq: remove unused blk_mq_clone_flush_request prototype
    blk-mq: fix waitqueue_active without memory barrier in block/blk-mq-tag.c
    fs: direct-io: don't dirtying pages for ITER_BVEC/ITER_KVEC direct read
    fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write
    block: kmemleak: Track the page allocations for struct request

    Linus Torvalds
     

03 Nov, 2015

1 commit

  • Hi,

    Zhangqing Luo reported long boot times on a system with thousands of
    LUNs when scsi-mq was enabled. He narrowed the problem down to
    blk_mq_add_queue_tag_set, where every queue is frozen in order to set
    the BLK_MQ_F_TAG_SHARED flag. Each added device will freeze all queues
    added before it in sequence, which involves waiting for an RCU grace
    period for each one. We don't need to do this. After the second queue
    is added, only new queues need to be initialized with the shared tag.
    We can do that by percolating the flag up to the blk_mq_tag_set, and
    updating the newly added queue's hctxs if the flag is set.

    This problem was introduced by commit 0d2602ca30e41 (blk-mq: improve
    support for shared tags maps).

    Reported-and-tested-by: Jason Luo
    Reviewed-by: Ming Lei
    Signed-off-by: Jeff Moyer
    Signed-off-by: Jens Axboe

    Jeff Moyer
     

28 Oct, 2015

1 commit

  • In commit b49a087("block: remove split code in
    blkdev_issue_{discard,write_same}"), discard_granularity and alignment
    checks were removed. Ideally, with bio late splitting, the upper layers
    shouldn't need to depend on device's limits.

    Christoph reported a discard regression on the HGST Ultrastar SN100 NVMe
    device when mkfs.xfs. We have not found the root cause yet.

    This patch re-adds discard_granularity and alignment checks by reverting
    the related changes in commit b49a087. The good thing is now we can
    remove the 2G discard size cap and just use UINT_MAX to avoid bi_size
    overflow.

    Reviewed-by: Christoph Hellwig
    Tested-by: Christoph Hellwig
    Signed-off-by: Ming Lin
    Reviewed-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Ming Lin
     

22 Oct, 2015

19 commits

  • The stat files on the root cgroup shows stats for the whole system and
    usually don't contain any information which isn't available through
    the usual system monitoring mechanisms. Some controllers skip
    collecting these duplicate stats to optimize cases where cgroup isn't
    used and later try to emulate the result on demand.

    This leads to complexities and subtle differences in the information
    shown through different channels. This is entirely unnecessary and
    cgroup v2 is dropping stat files which are duplicate from all
    controllers. This patch removes "io.stat" from the root hierarchy.

    Signed-off-by: Tejun Heo
    Acked-by: Jens Axboe
    Cc: Vivek Goyal

    Tejun Heo
     
  • Most of times, flush plug should be the hottest I/O path,
    so mark ctx as pending after all requests in the list are
    inserted.

    Reviewed-by: Jeff Moyer
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • The trace point is for tracing plug event of each request
    queue instead of each task, so we should check the request
    count in the plug list from current queue instead of
    current task.

    Signed-off-by: Ming Lei
    Reviewed-by: Jeff Moyer
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • After bio splitting is introduced, one bio can be splitted
    and it is marked as NOMERGE because it is too fat to be merged,
    so check bio_mergeable() earlier to avoid to try to merge it
    unnecessarily.

    Signed-off-by: Ming Lei
    Reviewed-by: Jeff Moyer
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • It isn't necessary to try to merge the bio which is marked
    as NOMERGE.

    Reviewed-by: Jeff Moyer
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • The splitted bio has been already too fat to merge, so mark it
    as NOMERGE.

    Reviewed-by: Jeff Moyer
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • The number of bio->bi_phys_segments is always obtained
    during bio splitting, so it is natural to setup it
    just after bio splitting, then we can avoid to compute
    nr_segment again during merge.

    Reviewed-by: Jeff Moyer
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • Request queues with merging disabled will not flush the plug list after
    BLK_MAX_REQUEST_COUNT requests have been queued, since the code relies
    on blk_attempt_plug_merge to compute the request_count. Fix this by
    computing the number of queued requests even for nomerge queues.

    Signed-off-by: Jeff Moyer
    Signed-off-by: Jens Axboe

    Jeff Moyer
     
  • This commits adds a driver API and ioctls for controlling Persistent
    Reservations s/genericly/generically/ at the block layer. Persistent
    Reservations are supported by SCSI and NVMe and allow controlling who gets
    access to a device in a shared storage setup.

    Note that we add a pr_ops structure to struct block_device_operations
    instead of adding the members directly to avoid bloating all instances
    of devices that will never support Persistent Reservations.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Split out helpers for all non-trivial ioctls to make this function simpler,
    and also start passing around a pointer version of the argument, as that's
    what most ioctl handlers actually need.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • The libnvidmm-btt and nvme drivers use blk_integrity to reserve space
    for per-sector metadata, but sometimes without protection checksums.
    This property is generically useful, so teach the block core to
    internally specify a nop profile if one is not provided at registration
    time.

    Cc: Keith Busch
    Cc: Matthew Wilcox
    Suggested-by: Christoph Hellwig
    [hch: kill the local nvme nop profile as well]
    Acked-by: Martin K. Petersen
    Signed-off-by: Dan Williams
    Signed-off-by: Jens Axboe

    Dan Williams
     
  • Since they lack requests to pin the request_queue active, synchronous
    bio-based drivers may have in-flight integrity work from
    bio_integrity_endio() that is not flushed by blk_freeze_queue(). Flush
    that work to prevent races to free the queue and the final usage of the
    blk_integrity profile.

    This is temporary unless/until bio-based drivers start to generically
    take a q_usage_counter reference while a bio is in-flight.

    Cc: Martin K. Petersen
    [martin: fix the CONFIG_BLK_DEV_INTEGRITY=n case]
    Tested-by: Ross Zwisler
    Signed-off-by: Dan Williams
    Signed-off-by: Jens Axboe

    Dan Williams
     
  • A trace like the following proceeds a crash in bio_integrity_process()
    when it goes to use an already freed blk_integrity profile.

    BUG: unable to handle kernel paging request at ffff8800d31b10d8
    IP: [] 0xffff8800d31b10d8
    PGD 2f65067 PUD 21fffd067 PMD 80000000d30001e3
    Oops: 0011 [#1] SMP
    Dumping ftrace buffer:
    ---------------------------------
    ndctl-2222 2.... 44526245us : disk_release: pmem1s
    systemd--2223 4.... 44573945us : bio_integrity_endio: pmem1s
    -409 4.... 44574005us : bio_integrity_process: pmem1s
    ---------------------------------
    [..]
    Call Trace:
    [] ? bio_integrity_process+0x159/0x2d0
    [] bio_integrity_verify_fn+0x36/0x60
    [] process_one_work+0x1cc/0x4e0

    Given that a request_queue is pinned while i/o is in flight and that a
    gendisk is allowed to have a shorter lifetime, move blk_integrity to
    request_queue to satisfy requests arriving after the gendisk has been
    torn down.

    Cc: Christoph Hellwig
    Cc: Martin K. Petersen
    [martin: fix the CONFIG_BLK_DEV_INTEGRITY=n case]
    Tested-by: Ross Zwisler
    Signed-off-by: Dan Williams
    Signed-off-by: Jens Axboe

    Dan Williams
     
  • Allow pmem, and other synchronous/bio-based block drivers, to fallback
    on a per-cpu reference count managed by the core for tracking queue
    live/dead state.

    The existing per-cpu reference count for the blk_mq case is promoted to
    be used in all block i/o scenarios. This involves initializing it by
    default, waiting for it to drop to zero at exit, and holding a live
    reference over the invocation of q->make_request_fn() in
    generic_make_request(). The blk_mq code continues to take its own
    reference per blk_mq request and retains the ability to freeze the
    queue, but the check that the queue is frozen is moved to
    generic_make_request().

    This fixes crash signatures like the following:

    BUG: unable to handle kernel paging request at ffff880140000000
    [..]
    Call Trace:
    [] ? copy_user_handle_tail+0x5f/0x70
    [] pmem_do_bvec.isra.11+0x70/0xf0 [nd_pmem]
    [] pmem_make_request+0xd1/0x200 [nd_pmem]
    [] ? mempool_alloc+0x72/0x1a0
    [] generic_make_request+0xd6/0x110
    [] submit_bio+0x76/0x170
    [] submit_bh_wbc+0x12f/0x160
    [] submit_bh+0x12/0x20
    [] jbd2_write_superblock+0x8d/0x170
    [] jbd2_mark_journal_empty+0x5d/0x90
    [] jbd2_journal_destroy+0x24b/0x270
    [] ? put_pwq_unlocked+0x2a/0x30
    [] ? destroy_workqueue+0x225/0x250
    [] ext4_put_super+0x64/0x360
    [] generic_shutdown_super+0x6a/0xf0

    Cc: Jens Axboe
    Cc: Keith Busch
    Cc: Ross Zwisler
    Suggested-by: Christoph Hellwig
    Reviewed-by: Christoph Hellwig
    Tested-by: Ross Zwisler
    Signed-off-by: Dan Williams
    Signed-off-by: Jens Axboe

    Dan Williams
     
  • Up until now the_integrity profile has been dynamically allocated and
    attached to struct gendisk after the disk has been made active.

    This causes problems because NVMe devices need to register the profile
    prior to the partition table being read due to a mandatory metadata
    buffer requirement. In addition, DM goes through hoops to deal with
    preallocating, but not initializing integrity profiles.

    Since the integrity profile is small (4 bytes + a pointer), Christoph
    suggested moving it to struct gendisk proper. This requires several
    changes:

    - Moving the blk_integrity definition to genhd.h.

    - Inlining blk_integrity in struct gendisk.

    - Removing the dynamic allocation code.

    - Adding helper functions which allow gendisk to set up and tear down
    the integrity sysfs dir when a disk is added/deleted.

    - Adding a blk_integrity_revalidate() callback for updating the stable
    pages bdi setting.

    - The calls that depend on whether a device has an integrity profile or
    not now key off of the bi->profile pointer.

    - Simplifying the integrity support routines in DM (Mike Snitzer).

    Signed-off-by: Martin K. Petersen
    Reported-by: Christoph Hellwig
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Mike Snitzer
    Cc: Dan Williams
    Signed-off-by: Dan Williams
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     
  • The size of the data interval was not exported in the sysfs integrity
    directory. Export it so that userland apps can tell whether the interval
    is different from the device's logical block size.

    Signed-off-by: Martin K. Petersen
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Dan Williams
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     
  • The per-device properties in the blk_integrity structure were previously
    unsigned short. However, most of the values fit inside a char. The only
    exception is the data interval size and we can work around that by
    storing it as a power of two.

    This cuts the size of the dynamic portion of blk_integrity in half.

    Signed-off-by: Martin K. Petersen
    Reported-by: Christoph Hellwig
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Dan Williams
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     
  • We previously made a complete copy of a device's data integrity profile
    even though several of the fields inside the blk_integrity struct are
    pointers to fixed template entries in t10-pi.c.

    Split the static and per-device portions so that we can reference the
    template directly.

    Signed-off-by: Martin K. Petersen
    Reported-by: Christoph Hellwig
    Reviewed-by: Sagi Grimberg
    Cc: Dan Williams
    Signed-off-by: Dan Williams
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     
  • The integrity kobject purely exists to support the integrity
    subdirectory in sysfs and doesn't really have anything to do with the
    blk_integrity data structure. Move the kobject to struct gendisk where
    it belongs.

    Signed-off-by: Martin K. Petersen
    Reported-by: Christoph Hellwig
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Dan Williams
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     

15 Oct, 2015

2 commits

  • bdi's are initialized in two steps, bdi_init() and bdi_register(), but
    destroyed in a single step by bdi_destroy() which, for a bdi embedded
    in a request_queue, is called during blk_cleanup_queue() which makes
    the queue invisible and starts the draining of remaining usages.

    A request_queue's user can access the congestion state of the embedded
    bdi as long as it holds a reference to the queue. As such, it may
    access the congested state of a queue which finished
    blk_cleanup_queue() but hasn't reached blk_release_queue() yet.
    Because the congested state was embedded in backing_dev_info which in
    turn is embedded in request_queue, accessing the congested state after
    bdi_destroy() was called was fine. The bdi was destroyed but the
    memory region for the congested state remained accessible till the
    queue got released.

    a13f35e87140 ("writeback: don't embed root bdi_writeback_congested in
    bdi_writeback") changed the situation. Now, the root congested state
    which is expected to be pinned while request_queue remains accessible
    is separately reference counted and the base ref is put during
    bdi_destroy(). This means that the root congested state may go away
    prematurely while the queue is between bdi_dstroy() and
    blk_cleanup_queue(), which was detected by Andrey's KASAN tests.

    The root cause of this problem is that bdi doesn't distinguish the two
    steps of destruction, unregistration and release, and now the root
    congested state actually requires a separate release step. To fix the
    issue, this patch separates out bdi_unregister() and bdi_exit() from
    bdi_destroy(). bdi_unregister() is called from blk_cleanup_queue()
    and bdi_exit() from blk_release_queue(). bdi_destroy() is now just a
    simple wrapper calling the two steps back-to-back.

    While at it, the prototype of bdi_destroy() is moved right below
    bdi_setup_and_register() so that the counterpart operations are
    located together.

    Signed-off-by: Tejun Heo
    Fixes: a13f35e87140 ("writeback: don't embed root bdi_writeback_congested in bdi_writeback")
    Cc: stable@vger.kernel.org # v4.2+
    Reported-and-tested-by: Andrey Konovalov
    Link: http://lkml.kernel.org/g/CAAeHK+zUJ74Zn17=rOyxacHU18SgCfC6bsYW=6kCY5GXJBwGfQ@mail.gmail.com
    Reviewed-by: Jan Kara
    Reviewed-by: Jeff Moyer
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • tags is freed in blk_mq_free_rq_map() and should not be used after that.
    The problem doesn't manifest if CONFIG_CPUMASK_OFFSTACK is false because
    free_cpumask_var() is nop.

    tags->cpumask is allocated in blk_mq_init_tags() so it's natural to
    free cpumask in its counter part, blk_mq_free_tags().

    Fixes: f26cdc8536ad ("blk-mq: Shared tag enhancements")
    Signed-off-by: Jun'ichi Nomura
    Cc: Keith Busch
    Reviewed-by: Jeff Moyer
    Signed-off-by: Jens Axboe

    Junichi Nomura
     

10 Oct, 2015

3 commits

  • Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • blk_mq_tag_update_depth() seems to be missing a memory barrier which
    might cause the waker to not notice the waiter and fail to send a
    wake_up as in the following figure.

    blk_mq_tag_update_depth bt_get
    ------------------------------------------------------------------------
    if (waitqueue_active(&bs->wait))
    /* The CPU might reorder the test for
    the waitqueue up here, before
    prior writes complete */
    prepare_to_wait(&bs->wait, &wait,
    TASK_UNINTERRUPTIBLE);
    tag = __bt_get(hctx, bt, last_tag,
    tags);
    /* Value set in bt_update_count not
    visible yet */
    bt_update_count(&tags->bitmap_tags, tdepth);
    /* blk_mq_tag_wakeup_all(tags, false); */
    bt = &tags->bitmap_tags;
    wake_index = atomic_read(&bt->wake_index);
    ...
    io_schedule();
    ------------------------------------------------------------------------

    This patch adds the missing memory barrier.

    I found this issue when I was looking through the linux source code
    for places calling waitqueue_active() before wake_up*(), but without
    preceding memory barriers, after sending a patch to fix a similar
    issue in drivers/tty/n_tty.c (Details about the original issue can be
    found here: https://lkml.org/lkml/2015/9/28/849).

    Signed-off-by: Kosuke Tatsukawa
    Signed-off-by: Jens Axboe

    Kosuke Tatsukawa
     
  • Linux 4.3-rc4

    Pulling in v4.3-rc4 to avoid conflicts with NVMe fixes that have gone
    in since for-4.4/core was based.

    Jens Axboe
     

01 Oct, 2015

1 commit