08 Nov, 2018

1 commit

  • syzbot is reporting NULL pointer dereference [1] which is caused by
    race condition between ioctl(loop_fd, LOOP_CLR_FD, 0) versus
    ioctl(other_loop_fd, LOOP_SET_FD, loop_fd) due to traversing other
    loop devices at loop_validate_file() without holding corresponding
    lo->lo_ctl_mutex locks.

    Since ioctl() request on loop devices is not frequent operation, we don't
    need fine grained locking. Let's use global lock in order to allow safe
    traversal at loop_validate_file().

    Note that syzbot is also reporting circular locking dependency between
    bdev->bd_mutex and lo->lo_ctl_mutex [2] which is caused by calling
    blkdev_reread_part() with lock held. This patch does not address it.

    [1] https://syzkaller.appspot.com/bug?id=f3cfe26e785d85f9ee259f385515291d21bd80a3
    [2] https://syzkaller.appspot.com/bug?id=bf154052f0eea4bc7712499e4569505907d15889

    Signed-off-by: Tetsuo Handa
    Reported-by: syzbot
    Reviewed-by: Jan Kara
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Tetsuo Handa
     

08 May, 2018

1 commit

  • syzbot is hitting WARN() triggered by memory allocation fault
    injection [1] because loop module is calling sysfs_remove_group()
    when sysfs_create_group() failed.
    Fix this by remembering whether sysfs_create_group() succeeded.

    [1] https://syzkaller.appspot.com/bug?id=3f86c0edf75c86d2633aeb9dd69eccc70bc7e90b

    Signed-off-by: Tetsuo Handa
    Reported-by: syzbot
    Reviewed-by: Greg Kroah-Hartman

    Renamed sysfs_ready -> sysfs_inited.

    Signed-off-by: Jens Axboe

    Tetsuo Handa
     

15 Apr, 2018

1 commit


26 Sep, 2017

1 commit

  • loop block device handles IO in a separate thread. The actual IO
    dispatched isn't cloned from the IO loop device received, so the
    dispatched IO loses the cgroup context.

    I'm ignoring buffer IO case now, which is quite complicated. Making the
    loop thread aware cgroup context doesn't really help. The loop device
    only writes to a single file. In current writeback cgroup
    implementation, the file can only belong to one cgroup.

    For direct IO case, we could workaround the issue in theory. For
    example, say we assign cgroup1 5M/s BW for loop device and cgroup2
    10M/s. We can create a special cgroup for loop thread and assign at
    least 15M/s for the underlayer disk. In this way, we correctly throttle
    the two cgroups. But this is tricky to setup.

    This patch tries to address the issue. We record bio's css in loop
    command. When loop thread is handling the command, we then use the API
    provided in patch 1 to set the css for current task. The bio layer will
    use the css for new IO (from patch 3).

    Acked-by: Tejun Heo
    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     

25 Sep, 2017

1 commit

  • When the request is completed, lo_complete_rq() checks cmd->use_aio.
    However, if this is in fact an aio request, cmd->use_aio will have
    already been reused as cmd->ref by lo_rw_aio*. Fix it by not using a
    union. On x86_64, there's a hole after the union anyways, so this
    doesn't make struct loop_cmd any bigger.

    Fixes: 92d773324b7e ("block/loop: fix use after free")
    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     

02 Sep, 2017

2 commits


01 Sep, 2017

2 commits

  • Currently loop disables merge. While it makes sense for buffer IO mode,
    directio mode can benefit from request merge. Without merge, loop could
    send small size IO to underlayer disk and harm performance.

    Reviewed-by: Omar Sandoval
    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • This is only used for setting the soft block size on the struct
    block_device once and then never used again.

    Reviewed-by: Ming Lei
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     

24 Aug, 2017

1 commit


08 Jun, 2017

1 commit

  • When generating bootable VM images certain systems (most notably
    s390x) require devices with 4k blocksize. This patch implements
    a new flag 'LO_FLAGS_BLOCKSIZE' which will set the physical
    blocksize to that of the underlying device, and allow to change
    the logical blocksize for up to the physical blocksize.

    Signed-off-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Hannes Reinecke
     

21 Apr, 2017

1 commit


24 Sep, 2015

3 commits

  • There are at least 3 advantages to use direct I/O and AIO on
    read/write loop's backing file:

    1) double cache can be avoided, then memory usage gets
    decreased a lot

    2) not like user space direct I/O, there isn't cost of
    pinning pages

    3) avoid context switch for obtaining good throughput
    - in buffered file read, random I/O top throughput is often obtained
    only if they are submitted concurrently from lots of tasks; but for
    sequential I/O, most of times they can be hit from page cache, so
    concurrent submissions often introduce unnecessary context switch
    and can't improve throughput much. There was such discussion[1]
    to use non-blocking I/O to improve the problem for application.
    - with direct I/O and AIO, concurrent submissions can be
    avoided and random read throughput can't be affected meantime

    xfstests(-g auto, ext4) is basically passed when running with
    direct I/O(aio), one exception is generic/232, but it failed in
    loop buffered I/O(4.2-rc6-next-20150814) too.

    Follows the fio test result for performance purpose:
    4 jobs fio test inside ext4 file system over loop block

    1) How to run
    - KVM: 4 VCPUs, 2G RAM
    - linux kernel: 4.2-rc6-next-20150814(base) with the patchset
    - the loop block is over one image on SSD.
    - linux psync, 4 jobs, size 1500M, ext4 over loop block
    - test result: IOPS from fio output

    2) Throughput(IOPS) becomes a bit better with direct I/O(aio)
    -------------------------------------------------------------
    test cases |randread |read |randwrite |write |
    -------------------------------------------------------------
    base |8015 |113811 |67442 |106978
    -------------------------------------------------------------
    base+loop aio |8136 |125040 |67811 |111376
    -------------------------------------------------------------

    - somehow, it should be caused by more page cache avaiable for
    application or one extra page copy is avoided in case of direct I/O

    3) context switch
    - context switch decreased by ~50% with loop direct I/O(aio)
    compared with loop buffered I/O(4.2-rc6-next-20150814)

    4) memory usage from /proc/meminfo
    -------------------------------------------------------------
    | Buffers | Cached
    -------------------------------------------------------------
    base | > 760MB | ~950MB
    -------------------------------------------------------------
    base+loop direct I/O(aio) | < 5MB | ~1.6GB
    -------------------------------------------------------------

    - so there are much more page caches available for application with
    direct I/O

    [1] https://lwn.net/Articles/612483/

    Signed-off-by: Ming Lei
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • This patches provides one interface for enabling direct IO
    from user space:

    - userspace(such as losetup) can pass 'file' which is
    opened/fcntl as O_DIRECT

    Also __loop_update_dio() is introduced to check if direct I/O
    can be used on current loop setting.

    The last big change is to introduce LO_FLAGS_DIRECT_IO flag
    for userspace to know if direct IO is used to access backing
    file.

    Cc: linux-api@vger.kernel.org
    Signed-off-by: Ming Lei
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • The following patch will use dio/aio to submit IO to backing file,
    then it needn't to schedule IO concurrently from work, so
    use kthread_work for decreasing context switch cost a lot.

    For non-AIO case, single thread has been used for long long time,
    and it was just converted to work in v4.0, which has caused performance
    regression for fedora live booting already. In discussion[1], even
    though submitting I/O via work concurrently can improve random read IO
    throughput, meantime it might hurt sequential read IO performance, so
    better to restore to single thread behaviour.

    For the following AIO support, it is better to use multi hw-queue
    with per-hwq kthread than current work approach suppose there is so
    high performance requirement for loop.

    [1] http://marc.info/?t=143082678400002&r=1&w=2

    Signed-off-by: Ming Lei
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Ming Lei
     

20 May, 2015

1 commit

  • The lo_ctl_mutex is held for running all ioctl handlers, and
    in some ioctl handlers, ioctl_by_bdev(BLKRRPART) is called for
    rereading partitions, which requires bd_mutex.

    So it is easy to cause failure because trylock(bd_mutex) may
    fail inside blkdev_reread_part(), and follows the lock context:

    blkid or other application:
    ->open()
    ->mutex_lock(bd_mutex)
    ->lo_open()
    ->mutex_lock(lo_ctl_mutex)

    losetup(set fd ioctl):
    ->mutex_lock(lo_ctl_mutex)
    ->ioctl_by_bdev(BLKRRPART)
    ->trylock(bd_mutex)

    This patch trys to eliminate the ABBA lock dependency by removing
    lo_ctl_mutext in lo_open() with the following approach:

    1) make lo_refcnt as atomic_t and avoid acquiring lo_ctl_mutex in lo_open():
    - for open vs. add/del loop, no any problem because of loop_index_mutex
    - freeze request queue during clr_fd, so I/O can't come until
    clearing fd is completed, like the effect of holding lo_ctl_mutex
    in lo_open
    - both open() and release() have been serialized by bd_mutex already

    2) don't hold lo_ctl_mutex for decreasing/checking lo_refcnt in
    lo_release(), then lo_ctl_mutex is only required for the last release.

    Reviewed-by: Christoph Hellwig
    Tested-by: Jarod Wilson
    Acked-by: Jarod Wilson
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

06 May, 2015

1 commit

  • Documentation/workqueue.txt:
    If there is dependency among multiple work items used
    during memory reclaim, they should be queued to separate
    wq each with WQ_MEM_RECLAIM.

    Loop devices can be stacked, so we have to convert to per-device
    workqueue. One example is Fedora live CD.

    Fixes: b5dd2f6047ca108001328aac0e8588edd15f1778
    Cc: stable@vger.kernel.org (v4.0)
    Cc: Justin M. Forbes
    Signed-off-by: Ming Lei
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Ming Lei
     

03 Jan, 2015

2 commits

  • Looks like we pull it in through other ways on x86, but we fail
    on sparc:

    In file included from drivers/block/cryptoloop.c:30:0:
    drivers/block/loop.h:63:24: error: field 'tag_set' has incomplete type
    struct blk_mq_tag_set tag_set;

    Add the include to loop.h, kill it from loop.c.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • The conversion is a bit straightforward, and use work queue to
    dispatch requests of loop block, and one big change is that requests
    is submitted to backend file/device concurrently with work queue,
    so throughput may get improved much. Given write requests over same
    file are often run exclusively, so don't handle them concurrently for
    avoiding extra context switch cost, possible lock contention and work
    schedule cost. Also with blk-mq, there is opportunity to get loop I/O
    merged before submitting to backend file/device.

    In the following test:
    - base: v3.19-rc2-2041231
    - loop over file in ext4 file system on SSD disk
    - bs: 4k, libaio, io depth: 64, O_DIRECT, num of jobs: 1
    - throughput: IOPS

    ------------------------------------------------------
    | | base | base with loop-mq | delta |
    ------------------------------------------------------
    | randread | 1740 | 25318 | +1355%|
    ------------------------------------------------------
    | read | 42196 | 51771 | +22.6%|
    -----------------------------------------------------
    | randwrite | 35709 | 34624 | -3% |
    -----------------------------------------------------
    | write | 39137 | 40326 | +3% |
    -----------------------------------------------------

    So loop-mq can improve throughput for both read and randread, meantime,
    performance of write and randwrite isn't hurted basically.

    Another benefit is that loop driver code gets simplified
    much after blk-mq conversion, and the patch can be thought as
    cleanup too.

    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

29 Jun, 2013

1 commit