18 Jul, 2018

3 commits

  • c11f0c0b5bb9 ("block/mm: make bdev_ops->rw_page() take a bool for
    read/write") replaced @op with boolean @is_write, which limited the
    amount of information going into ->rw_page() and more importantly
    page_endio(), which removed the need to expose block internals to mm.

    Unfortunately, we want to track discards separately and @is_write
    isn't enough information. This patch updates bdev_ops->rw_page() to
    take REQ_OP instead but leaves page_endio() to take bool @is_write.
    This allows the block part of operations to have enough information
    while not leaking it to mm.

    Signed-off-by: Tejun Heo
    Cc: Mike Christie
    Cc: Minchan Kim
    Cc: Dan Williams
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • * Remove checkpatch errors caused due to assignment operation in if
    condition

    Signed-off-by: RAGHU Halharvi
    Signed-off-by: Jens Axboe

    RAGHU Halharvi
     
  • In case of 'none' io scheduler, when hw queue isn't busy, it isn't
    necessary to enqueue request to sw queue and dequeue it from
    sw queue because request may be submitted to hw queue asap without
    extra cost, meantime there shouldn't be much request in sw queue,
    and we don't need to worry about effect on IO merge.

    There are still some single hw queue SCSI HBAs(HPSA, megaraid_sas, ...)
    which may connect high performance devices, so 'none' is often required
    for obtaining good performance.

    This patch improves IOPS and decreases CPU unilization on megaraid_sas,
    per Kashyap's test.

    Cc: Kashyap Desai
    Cc: Laurence Oberman
    Cc: Omar Sandoval
    Cc: Christoph Hellwig
    Cc: Bart Van Assche
    Cc: Hannes Reinecke
    Reported-by: Kashyap Desai
    Tested-by: Kashyap Desai
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

17 Jul, 2018

2 commits

  • In our longer tests we noticed that some boxes would degrade to the
    point of uselessness. This is because we truncate the current time when
    saving it in our bio, but I was using the raw current time to subtract
    from. So once the box had been up a certain amount of time it would
    appear as if our IO's were taking several years to complete. Fix this
    by truncating the current time so it matches the issue time. Verified
    this worked by running with this patch for a week on our test tier.

    Signed-off-by: Josef Bacik
    Signed-off-by: Jens Axboe

    Josef Bacik
     
  • Early versions of these patches had us waiting for seconds at a time
    during submission, so we had to adjust the timing window we monitored
    for latency. Now we don't do things like that so this is unnecessary
    code.

    Signed-off-by: Josef Bacik
    Signed-off-by: Jens Axboe

    Josef Bacik
     

13 Jul, 2018

12 commits

  • We can't know if a block is closed or not on 1.2 devices, so assume
    closed state to make sure that blocks are erased before writing.

    Fixes: 32ef9412c114 ("lightnvm: pblk: implement get log report chunk")
    Signed-off-by: Hans Holmberg
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Hans Holmberg
     
  • In the read path, partial reads are currently performed synchronously
    which affects performance for workloads that generate many partial
    reads. This patch adds an asynchronous partial read path as well as
    the required partial read ctx.

    Signed-off-by: Heiner Litz
    Reviewed-by: Igor Konopko
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Heiner Litz
     
  • In preparation to enabling -Wimplicit-fallthrough, mark switch cases
    where we are expecting to fall through.

    Signed-off-by: Gustavo A. R. Silva
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Gustavo A. R. Silva
     
  • The error messages in pblk does not say which pblk instance that
    a message occurred from. Update each error message to reflect the
    instance it belongs to, and also prefix it with pblk, so we know
    the message comes from the pblk module.

    Signed-off-by: Matias Bjørling
    Reviewed-by: Javier González
    Signed-off-by: Jens Axboe

    Matias Bjørling
     
  • For devices that does not specify a limit on its transfer size, the
    get_chk_meta command may send down a single I/O retrieving the full
    chunk metadata table. Resulting in large 2-4MB I/O requests. Instead,
    split up the I/Os to a maximum of 256KB and issue them separately to
    reduce memory requirements.

    Signed-off-by: Matias Bjørling
    Reviewed-by: Javier González
    Signed-off-by: Jens Axboe

    Matias Bjørling
     
  • If using pblk on a 32bit architecture, and there is a need to
    perform a partial read, the partial read bitmap will only have
    allocated 32 entries, where as 64 are needed.

    Make sure that the read_bitmap is initialized to 64bits on 32bit
    architectures as well.

    Signed-off-by: Matias Bjørling
    Reviewed-by: Igor Konopko
    Reviewed-by: Javier González
    Signed-off-by: Jens Axboe

    Matias Bjørling
     
  • Since both blk_old_get_request() and blk_mq_alloc_request() initialize
    rq->__data_len to zero, it is not necessary to initialize that member
    in nvme_nvm_alloc_request(). Hence remove the rq->__data_len
    initialization from nvme_nvm_alloc_request().

    Signed-off-by: Bart Van Assche
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • When recovering a line, an extra check was added when debugging was
    active, such that minor version where also checked. Unfortunately,
    this used the ifdef NVM_DEBUG, which is not correct.

    Instead use the proper DEBUG def, and now that it compiles, also fix
    the variable.

    Signed-off-by: Matias Bjørling
    Fixes: d0ab0b1ab991f ("lightnvm: pblk: check data lines version on recovery")
    Reviewed-by: Javier González
    Signed-off-by: Jens Axboe

    Matias Bjørling
     
  • There is no users of CONFIG_NVM_DEBUG in the LightNVM subsystem. All
    users are in pblk. Rename NVM_DEBUG to NVM_PBLK_DEBUG and enable
    only for pblk.

    Also fix up the CONFIG_NVM_PBLK entry to follow the code style for
    Kconfig files.

    Signed-off-by: Matias Bjørling
    Reviewed-by: Javier González
    Signed-off-by: Jens Axboe

    Matias Bjørling
     
  • Some devices can expose mw_cunits equal to 0, it can cause the
    creation of too small write buffer and cause performance to drop
    on write workloads.

    Additionally, write buffer size must cover write data requirements,
    such as WS_MIN and MW_CUNITS - it must be greater than or equal to
    the larger one multiplied by the number of PUs. However, for
    performance reasons, use the WS_OPT value to calculation instead of
    WS_MIN.

    Because the place where buffer size is calculated was changed, this
    patch also removes pgs_in_buffer filed in pblk structure.

    Signed-off-by: Marcin Dziegielewski
    Signed-off-by: Igor Konopko
    Reviewed-by: Javier González
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Marcin Dziegielewski
     
  • Remove blkdev_entry_to_request() macro, which remained unused through
    the observable history, also note that it repeats list_entry_rq() macro
    verbatim.

    Signed-off-by: Vladimir Zapolskiy
    Signed-off-by: Jens Axboe

    Vladimir Zapolskiy
     
  • Use the existing %pad printk format to print dma_addr_t values.
    This avoids the following warnings when compiling on the parisc64 platform:

    drivers/block/skd_main.c: In function 'skd_preop_sg_list':
    drivers/block/skd_main.c:660:4: warning: format '%llx' expects argument of type 'long long unsigned int', but argument 6 has type 'dma_addr_t {aka unsigned int}' [-Wformat=]

    Reviewed-by: Bart Van Assche
    Signed-off-by: Helge Deller
    Signed-off-by: Jens Axboe

    Helge Deller
     

12 Jul, 2018

1 commit

  • The code poses a security risk due to user memory access in ->release
    and had an API that can't be used reliably. As far as we know it was
    never used for real, but if that turns out wrong we'll have to revert
    this commit and come up with a band aid.

    Jann Horn did look software archives for users of this interface,
    and the only users found were example code in sg3_utils, and optional
    support in an optional module of the tgt user space iscsi target,
    which looks like a proof of concept extension of the /dev/sg
    read/write support.

    Tony Battersby chimes in that the code is basically unsafe to use in
    general:

    The read/write interface on /dev/bsg is impossible to use safely
    because the list of completed commands is per-device (bd->done_list)
    rather than per-fd like it is with /dev/sg. So if program A and
    program B are both using the write/read interface on the same bsg
    device, then their command responses will get mixed up, and program
    A will read() some command results from program B and vice versa.
    So no, I don't use read/write on /dev/bsg. From a security standpoint,
    it should definitely be fixed or removed.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

11 Jul, 2018

2 commits

  • max_depth used to be a u64, but I changed it to a unsigned int but
    didn't convert my comparisons over everywhere. Fix by using UINT_MAX
    everywhere instead of (u64)-1.

    Reported-by: Dan Carpenter
    Signed-off-by: Josef Bacik
    Signed-off-by: Jens Axboe

    Josef Bacik
     
  • On 32-bit architectures, dividing a 64-bit number needs to use the
    do_div() function or something like it to avoid a link failure:

    block/blk-iolatency.o: In function `iolatency_prfill_limit':
    blk-iolatency.c:(.text+0x8cc): undefined reference to `__aeabi_uldivmod'

    Using div_u64() gives us the best output and avoids the need for an
    explicit cast.

    Fixes: d70675121546 ("block: introduce blk-iolatency io controller")
    Reviewed-by: Josef Bacik
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Jens Axboe

    Arnd Bergmann
     

09 Jul, 2018

20 commits

  • Fix build warnings in DAC960.c when CONFIG_PROC_FS is not enabled
    by marking the unused functions as __maybe_unused.

    ../drivers/block/DAC960.c:6429:12: warning: 'dac960_proc_show' defined but not used [-Wunused-function]
    ../drivers/block/DAC960.c:6449:12: warning: 'dac960_initial_status_proc_show' defined but not used [-Wunused-function]
    ../drivers/block/DAC960.c:6456:12: warning: 'dac960_current_status_proc_show' defined but not used [-Wunused-function]

    Signed-off-by: Randy Dunlap
    Cc: Jens Axboe
    Cc: linux-block@vger.kernel.org
    Signed-off-by: Jens Axboe

    Randy Dunlap
     
  • Adds support for exposing a null_blk device through the zone device
    interface.

    The interface is managed with the parameters zoned and zone_size.
    If zoned is set, the null_blk instance registers as a zoned block
    device. The zone_size parameter defines how big each zone will be.

    Signed-off-by: Matias Bjørling
    Signed-off-by: Bart Van Assche
    Signed-off-by: Damien Le Moal
    Signed-off-by: Jens Axboe

    Matias Bjørling
     
  • Split the null_blk device driver, such that it can prepare for
    zoned block interface support.

    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Matias Bjørling
     
  • With gcc 4.9.0 and 7.3.0:

    block/blk-core.c: In function 'blk_pm_allow_request':
    block/blk-core.c:2747:2: warning: enumeration value 'RPM_ACTIVE' not handled in switch [-Wswitch]
    switch (rq->q->rpm_status) {
    ^

    Convert the return statement below the switch() block into a default
    case to fix this.

    Fixes: e4f36b249b4d4e75 ("block: fix peeking requests during PM")
    Signed-off-by: Geert Uytterhoeven
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Geert Uytterhoeven
     
  • If __blkdev_issue_discard is in progress and a device mapper device is
    reloaded with a table that doesn't support discard,
    q->limits.max_discard_sectors is set to zero. This results in infinite
    loop in __blkdev_issue_discard.

    This patch checks if max_discard_sectors is zero and aborts with
    -EOPNOTSUPP.

    Signed-off-by: Mikulas Patocka
    Tested-by: Zdenek Kabelac
    Cc: stable@vger.kernel.org
    Signed-off-by: Jens Axboe

    Mikulas Patocka
     
  • The flag GFP_ATOMIC already contains __GFP_HIGH. There is no need to
    explicitly or __GFP_HIGH again. So, just remove unnecessary __GFP_HIGH.

    Signed-off-by: Shakeel Butt
    Signed-off-by: Jens Axboe

    Shakeel Butt
     
  • Currently mbps knob could only be set once before switching power knob to
    on, after power knob has been set at least once, there is no way to set
    mbps knob again due to -EBUSY.

    As nullb is mainly used for testing, in order to make it flexible, this
    removes the flag NULLB_DEV_FL_CONFIGURED so that mbps knob can be reset
    when power knob is off, e.g.

    echo 0 > /config/nullb/a/power
    echo 40 > /config/nullb/a/mbps
    echo 1 > /config/nullb/a/power

    So does other knobs under /config/nullb/a.

    Signed-off-by: Liu Bo
    Signed-off-by: Jens Axboe

    Liu Bo
     
  • We noticed in testing we'd get pretty bad latency stalls under heavy
    pressure because read ahead would try to do its thing while the cgroup
    was under severe pressure. If we're under this much pressure we want to
    do as little IO as possible so we can still make progress on real work
    if we're a throttled cgroup, so just skip readahead if our group is
    under pressure.

    Signed-off-by: Josef Bacik
    Acked-by: Tejun Heo
    Acked-by: Andrew Morton
    Signed-off-by: Jens Axboe

    Josef Bacik
     
  • A basic documentation to describe the interface, statistics, and
    behavior of io.latency.

    Signed-off-by: Josef Bacik
    Signed-off-by: Jens Axboe

    Josef Bacik
     
  • Current IO controllers for the block layer are less than ideal for our
    use case. The io.max controller is great at hard limiting, but it is
    not work conserving. This patch introduces io.latency. You provide a
    latency target for your group and we monitor the io in short windows to
    make sure we are not exceeding those latency targets. This makes use of
    the rq-qos infrastructure and works much like the wbt stuff. There are
    a few differences from wbt

    - It's bio based, so the latency covers the whole block layer in addition to
    the actual io.
    - We will throttle all IO types that comes in here if we need to.
    - We use the mean latency over the 100ms window. This is because writes can
    be particularly fast, which could give us a false sense of the impact of
    other workloads on our protected workload.
    - By default there's no throttling, we set the queue_depth to INT_MAX so that
    we can have as many outstanding bio's as we're allowed to. Only at
    throttle time do we pay attention to the actual queue depth.
    - We backcharge cgroups for root cg issued IO and induce artificial
    delays in order to deal with cases like metadata only or swap heavy
    workloads.

    In testing this has worked out relatively well. Protected workloads
    will throttle noisy workloads down to 1 io at time if they are doing
    normal IO on their own, or induce up to a 1 second delay per syscall if
    they are doing a lot of root issued IO (metadata/swap IO).

    Our testing has revolved mostly around our production web servers where
    we have hhvm (the web server application) in a protected group and
    everything else in another group. We see slightly higher requests per
    second (RPS) on the test tier vs the control tier, and much more stable
    RPS across all machines in the test tier vs the control tier.

    Another test we run is a slow memory allocator in the unprotected group.
    Before this would eventually push us into swap and cause the whole box
    to die and not recover at all. With these patches we see slight RPS
    drops (usually 10-15%) before the memory consumer is properly killed and
    things recover within seconds.

    Signed-off-by: Josef Bacik
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Josef Bacik
     
  • wbt cares only about request completion time, but controllers may need
    information that is on the bio itself, so add a done_bio callback for
    rq-qos so things like blk-iolatency can use it to have the bio when it
    completes.

    Signed-off-by: Josef Bacik
    Signed-off-by: Jens Axboe

    Josef Bacik
     
  • We don't really need to save this stuff in the core block code, we can
    just pass the bio back into the helpers later on to derive the same
    flags and update the rq->wbt_flags appropriately.

    Signed-off-by: Josef Bacik
    Signed-off-by: Jens Axboe

    Josef Bacik
     
  • blkcg-qos is going to do essentially what wbt does, only on a cgroup
    basis. Break out the common code that will be shared between blkcg-qos
    and wbt into blk-rq-qos.* so they can both utilize the same
    infrastructure.

    Signed-off-by: Josef Bacik
    Signed-off-by: Jens Axboe

    Josef Bacik
     
  • We need to use blk_rq_stat in the blkcg qos stuff, so export some of
    these helpers so they can be used by other things.

    Signed-off-by: Josef Bacik
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Josef Bacik
     
  • Memory allocations can induce swapping via kswapd or direct reclaim. If
    we are having IO done for us by kswapd and don't actually go into direct
    reclaim we may never get scheduled for throttling. So instead check to
    see if our cgroup is congested, and if so schedule the throttling.
    Before we return to user space the throttling stuff will only throttle
    if we actually required it.

    Signed-off-by: Tejun Heo
    Signed-off-by: Josef Bacik
    Acked-by: Johannes Weiner
    Acked-by: Andrew Morton
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Since IO can be issued from literally anywhere it's almost impossible to
    do throttling without having some sort of adverse effect somewhere else
    in the system because of locking or other dependencies. The best way to
    solve this is to do the throttling when we know we aren't holding any
    other kernel resources. Do this by tracking throttling in a per-blkg
    basis, and if we require throttling flag the task that it needs to check
    before it returns to user space and possibly sleep there.

    This is to address the case where a process is doing work that is
    generating IO that can't be throttled, whether that is directly with a
    lot of REQ_META IO, or indirectly by allocating so much memory that it
    is swamping the disk with REQ_SWAP. We can't use task_add_work as we
    don't want to induce a memory allocation in the IO path, so simply
    saving the request queue in the task and flagging it to do the
    notify_resume thing achieves the same result without the overhead of a
    memory allocation.

    Signed-off-by: Josef Bacik
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Josef Bacik
     
  • For backcharging we need to know who the page belongs to when swapping
    it out. We don't worry about things that do ->rw_page (zram etc) at the
    moment, we're only worried about pages that actually go to a block
    device.

    Signed-off-by: Tejun Heo
    Signed-off-by: Josef Bacik
    Acked-by: Johannes Weiner
    Acked-by: Andrew Morton
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Just like REQ_META, it's important to know the IO coming down is swap
    in order to guard against potential IO priority inversion issues with
    cgroups. Add REQ_SWAP and use it for all swap IO, and add it to our
    bio_issue_as_root_blkg helper.

    Signed-off-by: Josef Bacik
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Josef Bacik
     
  • blk-iolatency has a few stats that it would like to print out, and
    instead of adding a bunch of crap to the generic code just provide a
    helper so that controllers can add stuff to the stat line if they want
    to.

    Hide it behind a boot option since it changes the output of io.stat from
    normal, and these stats are only interesting to developers.

    Signed-off-by: Josef Bacik
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Josef Bacik
     
  • Instead of forcing all file systems to get the right context on their
    bio's, simply check for REQ_META to see if we need to issue as the root
    blkg. We don't want to force all bio's to have the root blkg associated
    with them if REQ_META is set, as some controllers (blk-iolatency) need
    to know who the originating cgroup is so it can backcharge them for the
    work they are doing. This helper will make sure that the controllers do
    the proper thing wrt the IO priority and backcharging.

    Signed-off-by: Josef Bacik
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Josef Bacik